Tensorflow implementation for tabular data
My open source product. Rete neurale per la visione di Dati tabulari. (it.)
A simple implementation of a deep neural network architecture for tabular data with automatic generation of layers and layer-by-layer reduction in the number of neurons. With ease of use similar to classic machine learning methods.
In this article, we will consider the reason for creating this library, conduct a “tutorial” and compare the accuracy of forecasting DataRetClassifier and DataRegressor with classical machine learning methods.
Introduction
To predict tabular data, classical machine learning methods are most often used. Most often implemented in scikit-learn. One of the advantages of this library is ease of use. We prepare the data, do fit and predict, done.
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=4,
n_informative=2, n_redundant=0,
random_state=0, shuffle=False)
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X, y)
print(clf.predict([[0, 0, 0, 0]]))
The use of neural networks, in particular libraries tensorflow or PyTorch involves building the architecture of a neural network model and then training and forecasting. Requires a higher entry threshold.
Many ready-made architectures of neural networks have been implemented for working with images, text, and sound. Not much to work with tabular data – example tabnet.
The main purpose of creating DataRet set a lower entry threshold for working with neural networks. Implemented training and data prediction, as in classical methods, for example RandomForestClassifier or CatBoostClassifier. To do this, I created an automatic generation of the neural network architecture, based on the number of selected neurons in the first fully connected layer. The second goal was an attempt to approach the classical methods in terms of the accuracy of predicting structured tabular data.
The model has three classes:
DataRetClassifier for classification problems.
DataRegressor for regression problems
DataRetMultilabelClassifier for “multilabel” classification.
Advantages
simplicity and ease of use. Fit and predict et Voila!
automatic generation of neural network architecture
quick adjustment of model parameters
GPU support
high prediction accuracy
support for multilabel classification
Tensorflow under the hood 😉
Installation
The source code is currently hosted on GitHub at: GitHub – AbdualimovTP/datret: Tensorflow Implementation for Structured Tabular Data Binary installers of the latest released version are available on the website Python. Package Index (PyPI)
# PyPI
pip install datret
Dependencies
Fast start
Model training and prediction is implemented as in scikit-learn. Prepare the train and test sets and start training the model. Support for automatic data normalization for neural networks.
NB! Don’t forget to install the dependencies before using the model. You will need Tensorflow, Numpy, Pandas and Scikit-Learn installed.
NB! It is not necessary to perform one-hot encoding of the predicted values for the classification task. The model will do automatically.
# load library
from datret.datret import DatRetClassifier, DatRetRegressor, DatRetMultilabelClassifier
# prepare train, test split. As in sklearn.
# for example
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=i)
# Call the regressor or classifier and train the model.
DR = DatRetClassifier() # DatRetRegressor works on the same principle
DR.fit(X_train, y_train)
# predict the actual label (or class) over a new set of data.
DR_predict = DR.predict(X_test)
# predict the class probabilities for each data point.
DR_predict_proba = DR.predict_proba(X_test) # Missing in DatRetRegressor, DatRetMultilabelClassifier
Custom Model Options
Options :
epoch: int, default = 30. The number of epochs to train the model.
optimizer: str, (name of the optimizer) or optimizer instance. Cm. tf.keras.optimizersdefault =
Adam(learning_rate=0.001)
. In DatRetRegressor the default learning rate is 0.01.loss: loss function. str. Cm. tf.keras.losses the default value for DatRetClassifier =
CategoricalCrossentropy()
for DatRetRegressor =MeanSquaredError()
.verbose: ‘auto’, 0, 1, or 2, default = 0. Outputs model training by epoch.
number_neurons: int, default = 500. The number of layers in the first fully connected layer. Subsequent layers are generated automatically with half the number of neurons.
validation_split: float from 0 to 1, default = 0. Fraction of training data to be used as validation data. The model will extract this part of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch.
batch_size: int , default = 1. Number of samples per gradient update. steps_per_epoch is calculated automatically,
X_train.shape[0] // batch_size
shuffle: True or False, default = True. “Mixing” the training sample.
callback:
[]
default =[EarlyStopping(monitor="loss", mode="auto", patience=7, verbose=1), ReduceLROnPlateau(monitor="loss", factor=0.2, patience=3, min_lr=0.00001, verbose=1)]
. Callbacks : Utilities called at specific times during model training.
Custom Method Parameters fit
.
Options:
normalize: True or False, default True. Automatic normalization of incoming data. used MinMaxScaler.
Model setup example:
# load library
import tensorflow as tf
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam, Nadam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.losses import CategoricalCrossentropy, MeanSquaredError, BinaryCrossentropy
from datret.datret import DatRetClassifier, DatRetRegressor, DatRetMultilabelClassifier
# prepare train, test split. As in sklearn.
# for example
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=i)
# Call the regressor or classifier and train the model.
DR = DatRetClassifier(epoch=50,
optimizer=Nadam(learning_rate=0.001),
loss=BinaryCrossentropy(),
verbose=1,
number_neurons=1000,
validation_split = 0.1,
batch_size=100,
shuffle=True,
callback=[])
DR.fit(X_train, y_train, normalize=True)
# predict the actual label (or class) over a new set of data.
DR_predict = DR.predict(X_test)
# predict the class probabilities for each data point.
DR_predict_proba = DR.predict_proba(X_test)
Model architecture
The model automatically generates the architecture based on the number of neurons in the first fully connected layer. For example, when using number_neurons = 500
in the first fully connected layer and the presence of 2 predictable classes (0, 1) – the neural network will automatically have this architecture.
Model: "DatRet with number_neurons = 500"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, X_train.shape[0)] 0
dense (Dense) (None, 500) 150500
dense_1 (Dense) (None, 250) 125250
dense_2 (Dense) (None, 125) 31375
dense_3 (Dense) (None, 62) 7812
dense_4 (Dense) (None, 31) 1953
dense_5 (Dense) (None, 15) 480
dense_6 (Dense) (None, 7) 112
dense_7 (Dense) (None, 3) 24
dense_8 (Dense) (None, 2) 8
(2 predictable classes)
=================================================================
Total params: 317,514
Trainable params: 317,514
Non-trainable params: 0
Comparison of accuracy with classical machine learning methods
DatRetClassifier for classification tasks
To evaluate the accuracy of the classifier, we will use Pima Indians Diabetes Database | Kaggle. RocAucScore metric. I will compare DatRet with RandomForest and CatBoost out of the box. The full version of the laptop is implemented in GitHub.
for i in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]:
X_train, X_test, y_train, y_test = train_test_split(data.drop(["Outcome"], axis=1), data["Outcome"],
random_state=10, test_size=i)
#RandomForest
RF = RandomForestClassifier(random_state=0)
RF.fit(X_train, y_train)
RF_pred = RF.predict_proba(X_test)
dataFrameRocAuc.loc['RandomForest'][f'{int(i*100)}%'] = np.round(roc_auc_score(y_test, RF_pred[:,1]), 2)
#Catboost
CB = CatBoostClassifier(random_state=0, verbose=0)
CB.fit(X_train, y_train)
CB_pred = CB.predict_proba(X_test)
dataFrameRocAuc.loc['CatBoost'][f'{int(i*100)}%'] = np.round(roc_auc_score(y_test, CB_pred[:,1]), 2)
#DatRet
DR = DatRetClassifier(optimizer=Adam(learning_rate=0.001))
DR.fit(X_train, y_train)
DR_pred = DR.predict_proba(X_test)
dataFrameRocAuc.loc['DatRet'][f'{int(i*100)}%'] = np.round(roc_auc_score(y_test, DR_pred[:,1]), 2)
10% | 20% | thirty% | 40% | 50% | 60% | |
---|---|---|---|---|---|---|
Random Forest | 0.79 | 0.81 | 0.81 | 0.79 | 0.82 | 0.82 |
catboost | 0.78 | 0.82 | 0.82 | 0.8 | 0.81 | 0.82 |
DataRet | 0.79 | 0.84 | 0.82 | 0.81 | 0.84 | 0.81 |
DataRegressor for regression problems
To evaluate the accuracy of the regressor, we will use the datasets Medical Cost Personal Datasets | Kaggle. Metrics root mean square error . I will compare DatRet with RandomForest and CatBoost out of the box. The full version of the laptop is implemented in GitHub.
for i in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]:
X_train, X_test, y_train, y_test = train_test_split(data.drop(["charges"], axis=1), data["charges"],
random_state=10, test_size=i)
#RandomForest
RF = RandomForestRegressor(random_state=0)
RF.fit(X_train, y_train)
RF_pred = RF.predict(X_test)
dataFrameRMSE.loc['RandomForest'][f'{int(i*100)}%'] = np.round(mean_squared_error(y_test, RF_pred, squared=False), 2)
#Catboost
CB = CatBoostRegressor(random_state=0, verbose=0)
CB.fit(X_train, y_train)
CB_pred = CB.predict(X_test)
dataFrameRMSE.loc['CatBoost'][f'{int(i*100)}%'] = np.round(mean_squared_error(y_test, CB_pred, squared=False), 2)
#DatRet
DR = DatRetRegressor(optimizer=Adam(learning_rate=0.01))
DR.fit(X_train, y_train)
DR_pred = DR.predict(X_test)
dataFrameRMSE.loc['DatRet'][f'{int(i*100)}%'] = np.round(mean_squared_error(y_test, DR_pred, squared=False), 2)
10% | 20% | thirty% | 40% | 50% | 60% | |
---|---|---|---|---|---|---|
Random Forest | 5736 | 5295 | 4777 | 4956 | 4904 | 4793 |
catboost | 5732 | 5251 | 4664 | 4986 | 5044 | 4989 |
DataRet | 5860 | 5173 | 4610 | 4927 | 5047 | 5780 |
Not bad results for the out-of-the-box model.
In the classification problem, by 10%, 20%, 30%, 40%, 50% of the total dataset of the test sample DataRet showed the best results.
In the regression problem by 20%, 30%, 40% of the total dataset of the test sample DataRet gives better accuracy.
In the future, I plan to evaluate the accuracy of the model on other datasets. I also see opportunities to improve the quality of forecasting. I plan to implement in the next versions of the library.