Tensorflow implementation for tabular data


Author's image
Author’s image

My open source product. Rete neurale per la visione di Dati tabulari. (it.)

A simple implementation of a deep neural network architecture for tabular data with automatic generation of layers and layer-by-layer reduction in the number of neurons. With ease of use similar to classic machine learning methods.

In this article, we will consider the reason for creating this library, conduct a “tutorial” and compare the accuracy of forecasting DataRetClassifier and DataRegressor with classical machine learning methods.


Introduction

To predict tabular data, classical machine learning methods are most often used. Most often implemented in scikit-learn. One of the advantages of this library is ease of use. We prepare the data, do fit and predict, done.

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=4,
                            n_informative=2, n_redundant=0,
                           random_state=0, shuffle=False)
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X, y)
print(clf.predict([[0, 0, 0, 0]]))

The use of neural networks, in particular libraries tensorflow or PyTorch involves building the architecture of a neural network model and then training and forecasting. Requires a higher entry threshold.

Many ready-made architectures of neural networks have been implemented for working with images, text, and sound. Not much to work with tabular data – example tabnet.

The main purpose of creating DataRet set a lower entry threshold for working with neural networks. Implemented training and data prediction, as in classical methods, for example RandomForestClassifier or CatBoostClassifier. To do this, I created an automatic generation of the neural network architecture, based on the number of selected neurons in the first fully connected layer. The second goal was an attempt to approach the classical methods in terms of the accuracy of predicting structured tabular data.

The model has three classes:

  • DataRetClassifier for classification problems.

  • DataRegressor for regression problems

  • DataRetMultilabelClassifier for “multilabel” classification.

Advantages

  • simplicity and ease of use. Fit and predict et Voila!

  • automatic generation of neural network architecture

  • quick adjustment of model parameters

  • GPU support

  • high prediction accuracy

  • support for multilabel classification

  • Tensorflow under the hood 😉

Installation

The source code is currently hosted on GitHub at: GitHub – AbdualimovTP/datret: Tensorflow Implementation for Structured Tabular Data Binary installers of the latest released version are available on the website Python. Package Index (PyPI)

# PyPI
pip install datret

Dependencies

Fast start

Model training and prediction is implemented as in scikit-learn. Prepare the train and test sets and start training the model. Support for automatic data normalization for neural networks.

NB! Don’t forget to install the dependencies before using the model. You will need Tensorflow, Numpy, Pandas and Scikit-Learn installed.

NB! It is not necessary to perform one-hot encoding of the predicted values ​​for the classification task. The model will do automatically.

# load library
from datret.datret import DatRetClassifier, DatRetRegressor, DatRetMultilabelClassifier

# prepare train, test split. As in sklearn.
# for example
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=i)

# Call the regressor or classifier and train the model.
DR = DatRetClassifier() # DatRetRegressor works on the same principle
DR.fit(X_train, y_train)
# predict the actual label (or class) over a new set of data.
DR_predict = DR.predict(X_test)
# predict the class probabilities for each data point.
DR_predict_proba = DR.predict_proba(X_test) # Missing in DatRetRegressor, DatRetMultilabelClassifier

Custom Model Options

Options :

  • epoch: int, default = 30. The number of epochs to train the model.

  • optimizer: str, (name of the optimizer) or optimizer instance. Cm. tf.keras.optimizersdefault = Adam(learning_rate=0.001). In DatRetRegressor the default learning rate is 0.01.

  • loss: loss function. str. Cm. tf.keras.losses the default value for DatRetClassifier = CategoricalCrossentropy()for DatRetRegressor = MeanSquaredError().

  • verbose: ‘auto’, 0, 1, or 2, default = 0. Outputs model training by epoch.

  • number_neurons: int, default = 500. The number of layers in the first fully connected layer. Subsequent layers are generated automatically with half the number of neurons.

  • validation_split: float from 0 to 1, default = 0. Fraction of training data to be used as validation data. The model will extract this part of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch.

  • batch_size: int , default = 1. Number of samples per gradient update. steps_per_epoch is calculated automatically, X_train.shape[0] // batch_size

  • shuffle: True or False, default = True. “Mixing” the training sample.

  • callback: []default = [EarlyStopping(monitor="loss", mode="auto", patience=7, verbose=1), ReduceLROnPlateau(monitor="loss", factor=0.2, patience=3, min_lr=0.00001, verbose=1)]. Callbacks : Utilities called at specific times during model training.

Custom Method Parameters fit .

Options:

  • normalize: True or False, default True. Automatic normalization of incoming data. used MinMaxScaler.

Model setup example:

# load library
import tensorflow as tf
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam, Nadam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.losses import CategoricalCrossentropy, MeanSquaredError, BinaryCrossentropy
from datret.datret import DatRetClassifier, DatRetRegressor, DatRetMultilabelClassifier

# prepare train, test split. As in sklearn.
# for example
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=i)

# Call the regressor or classifier and train the model.
DR = DatRetClassifier(epoch=50,
                      optimizer=Nadam(learning_rate=0.001),
                      loss=BinaryCrossentropy(),
                      verbose=1,
                      number_neurons=1000,
                      validation_split = 0.1,
                      batch_size=100,
                      shuffle=True,
                      callback=[])
DR.fit(X_train, y_train, normalize=True)
# predict the actual label (or class) over a new set of data.
DR_predict = DR.predict(X_test)
# predict the class probabilities for each data point.
DR_predict_proba = DR.predict_proba(X_test)

Model architecture

The model automatically generates the architecture based on the number of neurons in the first fully connected layer. For example, when using number_neurons = 500in the first fully connected layer and the presence of 2 predictable classes (0, 1) – the neural network will automatically have this architecture.

Model: "DatRet with number_neurons = 500"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, X_train.shape[0)]      0         

 dense (Dense)               (None, 500)               150500    

 dense_1 (Dense)             (None, 250)               125250    

 dense_2 (Dense)             (None, 125)               31375     

 dense_3 (Dense)             (None, 62)                7812      

 dense_4 (Dense)             (None, 31)                1953      

 dense_5 (Dense)             (None, 15)                480       

 dense_6 (Dense)             (None, 7)                 112       

 dense_7 (Dense)             (None, 3)                 24        

 dense_8 (Dense)             (None, 2)                 8         
                       (2 predictable classes)                               
=================================================================
Total params: 317,514
Trainable params: 317,514
Non-trainable params: 0

Comparison of accuracy with classical machine learning methods

DatRetClassifier for classification tasks

To evaluate the accuracy of the classifier, we will use Pima Indians Diabetes Database | Kaggle. RocAucScore metric. I will compare DatRet with RandomForest and CatBoost out of the box. The full version of the laptop is implemented in GitHub.

for i in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]:
    X_train, X_test, y_train, y_test = train_test_split(data.drop(["Outcome"], axis=1), data["Outcome"],
                                                random_state=10, test_size=i)
    #RandomForest
    RF = RandomForestClassifier(random_state=0)
    RF.fit(X_train, y_train)
    RF_pred = RF.predict_proba(X_test)
    dataFrameRocAuc.loc['RandomForest'][f'{int(i*100)}%'] = np.round(roc_auc_score(y_test, RF_pred[:,1]), 2)
    
    #Catboost
    CB = CatBoostClassifier(random_state=0, verbose=0)
    CB.fit(X_train, y_train)
    CB_pred = CB.predict_proba(X_test)
    dataFrameRocAuc.loc['CatBoost'][f'{int(i*100)}%'] = np.round(roc_auc_score(y_test, CB_pred[:,1]), 2)
    
    #DatRet
    DR = DatRetClassifier(optimizer=Adam(learning_rate=0.001))
    DR.fit(X_train, y_train)
    DR_pred = DR.predict_proba(X_test)
    dataFrameRocAuc.loc['DatRet'][f'{int(i*100)}%'] = np.round(roc_auc_score(y_test, DR_pred[:,1]), 2)

10%

20%

thirty%

40%

50%

60%

Random Forest

0.79

0.81

0.81

0.79

0.82

0.82

catboost

0.78

0.82

0.82

0.8

0.81

0.82

DataRet

0.79

0.84

0.82

0.81

0.84

0.81

DataRegressor for regression problems

To evaluate the accuracy of the regressor, we will use the datasets Medical Cost Personal Datasets | Kaggle. Metrics root mean square error . I will compare DatRet with RandomForest and CatBoost out of the box. The full version of the laptop is implemented in GitHub.

for i in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]:
    X_train, X_test, y_train, y_test = train_test_split(data.drop(["charges"], axis=1), data["charges"],
                                                random_state=10, test_size=i)
    #RandomForest
    RF = RandomForestRegressor(random_state=0)
    RF.fit(X_train, y_train)
    RF_pred = RF.predict(X_test)
    dataFrameRMSE.loc['RandomForest'][f'{int(i*100)}%'] = np.round(mean_squared_error(y_test, RF_pred, squared=False), 2)
    
    #Catboost
    CB = CatBoostRegressor(random_state=0, verbose=0)
    CB.fit(X_train, y_train)
    CB_pred = CB.predict(X_test)
    dataFrameRMSE.loc['CatBoost'][f'{int(i*100)}%'] = np.round(mean_squared_error(y_test, CB_pred, squared=False), 2)
    
    #DatRet
    DR = DatRetRegressor(optimizer=Adam(learning_rate=0.01))
    DR.fit(X_train, y_train)
    DR_pred = DR.predict(X_test)
    dataFrameRMSE.loc['DatRet'][f'{int(i*100)}%'] = np.round(mean_squared_error(y_test, DR_pred, squared=False), 2)

10%

20%

thirty%

40%

50%

60%

Random Forest

5736

5295

4777

4956

4904

4793

catboost

5732

5251

4664

4986

5044

4989

DataRet

5860

5173

4610

4927

5047

5780

Not bad results for the out-of-the-box model.

In the classification problem, by 10%, 20%, 30%, 40%, 50% of the total dataset of the test sample DataRet showed the best results.

In the regression problem by 20%, 30%, 40% of the total dataset of the test sample DataRet gives better accuracy.

In the future, I plan to evaluate the accuracy of the model on other datasets. I also see opportunities to improve the quality of forecasting. I plan to implement in the next versions of the library.