How I Stopped Worrying and Loved Absolute Activation

A bit of history

It all started with lectures. To illustrate the operation of a neural network, simple examples are needed. It is well known that a single neuron forms a separating hyperplane, and therefore, problems like “find me which straight line separates two colors on the flag of Monaco (which consists of two horizontal stripes)” one neuron solves at a time. Problems start later, for example with the flag of Japan (which consists of a red circle on a white background) – one neuron does not solve this problem well. Usually, the standard solution method is ‘brute force’: let’s increase the number of neurons, put a decision layer, and the problem will be solved. And here the number one problem arises: how many neurons to put in the hidden layer. The traditional answer from all the training literature is to choose empirically. On the one hand, there should not be too many of them, because there will be many unknown parameters, and on the other hand, too few are also not very good, because with one neuron we have already burned ourselves. So, the standard question is: how many neurons do you really need?

It turns out that the answer to this question has long been there: in this problem – exactly five. There is such Kolmogorov-Arnold theorem, where it is proved that if we take five neurons, then for them there are some smooth activation functions for which a two-layer neuron will solve almost any simple problem for two-dimensional input data. And this was proven already at the end of the 50s of the 20th century and solved one of the most important mathematical problems of the 20th century – the 13th Hilbert’s problem. The key issue here is “some smooth activation functions”. After all, no one said exactly what they are, and therefore you need to look for them.

Neuron activation functions invented a lot ofand for many of them proventhat a multilayer network based on them can approximate our unknown colors correctly. But we ask the question – which activation function is better?

If you dig a little more into the literature on neural networks, you can find the answer to this question. The fact is that modern multilayer neural networks are trained gradient descent methodsbut also its variations (ADAM, RMSProp, and so on), and this is where the dog is buried.

A fully connected neural network (for example, a four-layer, consisting only of classical neurons) implements an operation something like this:

Where $f_i=\phi(A_i \cdot x + B_i)$ – what mathematically the i-th layer of neurons does with its input vector, $\phi$ are the activation functions of these neurons, and – some (matrix true, but this is not important now) coefficients that must be found by the gradient descent method.

How to look for them? Necessary calculate derivativeand here it turns out that the derivative needed to find, for example looks like:

$\frac {dy}{dA_4} \approx (\phi')^4$

That is the problem. If there are many layers in the neural network (N), then to find the coefficients of the very first layer, you need to build the derivative of the activation function $\phi'$ to the Nth power. And numbers in high powers have a very bad feature – if the number is greater than one, then its high power tends to infinity, and if the number is less than one, it drops to zero. It’s called an explosion gradient decay accordingly, and it makes it very difficult to train networks with a depth of more than 10-11 layers. It is to circumvent this problem that different activation functions and Residual blocks in modern neural networks.

It is clear that it is best to use activation functions for which this derivative is equal to one. The simplest such function is the linear activation $\phi(x)=x$ . But it’s almost never used, because multiple layers with linear activation are like one layer, and the deep network suddenly becomes as accurate as a single layer. Therefore, linear activation is most often used only on the output layers of neural networks.

For a known sigmoid activation:

$\sigma(x)=\frac {1}{1+e^{-x}}$

everything is bad – there the maximum value of the derivative is 1/4, and to the 10th degree it is about one millionth, and the first layers of a ten-layer neuron will not learn, only the last ones. Therefore, with sigmoids in hidden layers, it makes little sense to train deep networks.

Much better Tanh is the hyperbolic tangent. Its derivative in a small neighborhood of the maximum turns into 1, so it is slightly better than the sigmoid and everything learns better and deeper.

Quite a different thing ReLU – its derivative becomes 1 on the entire positive part of the axis. On the negative it is zero. Therefore, everything is much better with her – a derivative of order 1, while you are on ~~light side of the force~~ the positive part of the axis.

But all the same, it’s good, but not very good, because on the dark side there are barrels of jam and baskets of cookies. If you take a closer look at the formula above, which is about the fourth power of the derivative, it becomes obvious that the derivative can be not only +1, but also -1, and all such functions also do not cause the gradient to fade.

And the simplest such non-linear function is the modulus (absolute value of a number):

$\phi(x)=| x|$

. Quite a lot has been known about it for a long time, but somehow it has not taken root in machine learning. Its derivative is 1 on the positive semiaxis, and -1 on the negative one. And therefore the N-th degree of this derivative does not increase, and does not fall, and is always equal to either 1 or -1.

Therefore, let’s see what it will give us in some standard task, for example, the MNIST task.

Hand passes over the MNIST problem

Task MNIST is the definition of handwritten digits. Each digit is a 28×28 grayscale image. If you ask no matter who – even an ML programmer, even ChatGPT, they will immediately explain to you that the MNIST problem needs to be solved with convolutional networks. One of the first successful solutions is the convolutional network LeNet-5and as you can see from the previous link, without any preprocessing, ensembles and augmentations, the accuracy of such a solution is of the order 99.05%. Let’s try to get better accuracy.

The task is simple, so you can use the Tensorflow library. Let’s create a file lanet.py with code:

import tensorflow as tf
from tensorflow.keras import datasets, layers, models, losses

def lenet(x_train_shape,lr=1e-3):
  input = layers.Input(shape=x_train_shape)
  data = input
  data = layers.Conv2D(6, 5, padding='same', activation='linear')(data)
  data = tf.nn.tanh(data)
  data = layers.AveragePooling2D(2)(data)
  data = tf.nn.tanh(data)
  data = layers.Conv2D(16, 5, activation='linear')(data)
  data = tf.nn.tanh(data)
  data = layers.AveragePooling2D(2)(data)
  data = tf.nn.tanh(data)
  data = layers.Conv2D(120,1, activation='linear')(data)
  data = tf.nn.tanh(data)
  data = layers.Flatten()(data)
  data = layers.Dense(84, activation='linear')(data)
  data = tf.nn.tanh(data)
  data_out = layers.Dense(10, activation='softmax')(data)
  model = tf.keras.models.Model(inputs=input,outputs=data_out)
  return model

This will be our original model – the standard LeNet-5.

The next step is to train her. Let’s create a custom_callback.py file, and put the code for reading the MNIST dataset and additional functions required for training into it:

import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow.keras import datasets, layers, models, losses
import tensorflow.keras as keras
import os
import numpy as np
# f=open('./log_params.dat','wt')
# f.close()

(x_all,y_all),(x_test,y_test) = datasets.mnist.load_data()
x_all = tf.pad(x_all, [[0, 0], [2,2], [2,2]])
x_test = tf.pad(x_test, [[0, 0], [2,2], [2,2]])

x_valid = x_all[x_all.shape[0]*80//100:,:,:]
x_train = x_all[:x_all.shape[0]*80//100,:,:]
y_valid = y_all[y_all.shape[0]*80//100:]
y_train = y_all[:y_all.shape[0]*80//100]

# for conv2D expand dims - add color level
x_train = tf.expand_dims(x_train, axis=3, name=None)
x_valid = tf.expand_dims(x_valid, axis=3, name=None)
x_test = tf.expand_dims(x_test, axis=3, name=None)

LR=0.1
class CustomCallback(keras.callbacks.Callback):
    def __init__(self, patience=5, name="model"):
        super(CustomCallback, self).__init__()
        self.patience=patience
        self.max_expected_acc=-1e10
        self.save_model_name=name
        self.best_epoch=-1
        
    def on_epoch_end(self, epoch, logs=None):
        keys = list(logs.keys())
        res1=self.model.evaluate(x=x_valid[:y_valid.shape[0]//2],y=y_valid[:y_valid.shape[0]//2],verbose=0)
        res2=self.model.evaluate(x=x_valid[y_valid.shape[0]//2:],y=y_valid[y_valid.shape[0]//2:],verbose=0)

        self.full_acc=logs['accuracy']
        self.val_acc=logs['val_accuracy']
        
        self.expected_acc=np.minimum(res1[1],res2[1]) 

        if self.expected_acc>self.max_expected_acc:
          self.best_epoch=epoch
          self.max_expected_acc=self.expected_acc
          print('save as optimal model')
          tf.keras.models.save_model(self.model, './'+self.save_model_name)
          f=open('./'+self.save_model_name+'/params.dat','wt')
          f.write('learning rate:'+str(LR)+'\n')
          f.write('best epoch:'+str(epoch+1)+'\n')
          f.write('train acc:'+str(self.full_acc)+'\n')
          f.write('val acc:'+str(self.val_acc)+'\n')
          f.write('expected acc:'+str(self.expected_acc)+'\n')
          f.close()
        if epoch>self.best_epoch+10:
          self.model.stop_training=True

Here we read the standard MNIST dataset, which is initially divided into training (x_all, y_all) and test (x_test, y_test) samples.

The test sample cannot be touched, it is intended only to determine the final accuracy of our solution. All training operations will be carried out with x_all, y_all. In the first matrix we have images of 28×28 handwritten digits, and in the second matrix we have the values of these written digits. We split the x_all dataset into two subsets: x_train – on which we will train the neural network, and x_valid – on which we will determine how well we are learning, and is it time to stop.

The CustomCallback class is responsible for stopping training. The problem is that our task is not to obtain maximum accuracy on the training or validation dataset. Our task is to obtain maximum accuracy on a test dataset unknown to us during training. And all we know about him is that he is in some way similar to the previous two.

This means that the accuracy on the test dataset will differ from both the accuracy on the training dataset and the accuracy on the validation dataset. Therefore, we should not stop when the accuracy on the training or validation datasets is maximum. And then, when we consider that on any test dataset unknown to us, the accuracy of our network will be maximum.

How to do it? The simplest thing is to remember that the accuracy is calculated by random values, which means that the accuracy is a random value (after all, the validation dataset was chosen randomly from the general dataset), and therefore has a certain probability density distribution, and the accuracy on the validation dataset is simply the average value of this distribution. If we want to predict the accuracy on the test dataset, we need something else – we need to guess some lower bound of this distribution, below which the accuracy on the test dataset will not fall. The CustomCallback class is dedicated to this guessing. The idea is quite simple – we do not use the accuracy on the validation dataset, but the minimum accuracy of the accuracies achieved on the two halves of the validation dataset. One of them will probably be below the average, and how much below will be determined by the accuracy probability density distribution, and it turns out to be below the average by about one width of the distribution (calculated from the standard deviation). In self.expected_acc is exactly that value.

Well, as they say in one joke, now we will try to take off with all this. And for this we will create the main file train.py:

import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow.keras import datasets, layers, models, losses
import tensorflow.keras as keras
import os
import gc
import numpy as np

import custom_callback as cc
from lenet import lenet

def train(model_func,src,y,src_v,y_v,model_name="model",saveOpt=False,epochs=10,custom_callback=''):
    model=model_func
    ycat=y #tf.keras.utils.to_categorical(y,num_classes=10)
#    model.summary()
    history=model.fit(src,ycat,epochs=epochs,
                      validation_data=[src_v,y_v],
                      verbose=1,
                      callbacks=[custom_callback])
    model=tf.keras.models.load_model('./'+model_name)        
    return model,history


# LeNet
# tf.keras.utils.plot_model(model_lenet,show_layer_activations=True,show_shapes=True,to_file="./model.png")
import matplotlib.pyplot as pp
import numpy as np
        
model=lenet(cc.x_train.shape[1:],lr=0.01)
prev_lr=0
for cc.LR in [0.001,0.0001,0.00001,0.000001]:
 print("train learing rate",cc.LR)
 if (prev_lr>0):
  model=tf.keras.models.load_model('model')
 model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=cc.LR), 
                loss=losses.sparse_categorical_crossentropy, metrics=['accuracy'])
 my_ccb=cc.CustomCallback(patience=1,name="model")
 model_lenet,history_lenet = train(model,cc.x_train,cc.y_train,cc.x_valid,cc.y_valid,
                                  model_name="model",
                                  epochs=1000, #stop after not finding best results during 10 epochs, see callback
                                  saveOpt=True,
                                  custom_callback=my_ccb)
 prev_lr=cc.LR
 gc.collect()
model_lenet=tf.keras.models.load_model('model')
print('loss,accuracy:',model_lenet.evaluate(cc.x_test,cc.y_test,verbose=0),
                              model_lenet.evaluate(cc.x_train,cc.y_train,verbose=0),
                              model_lenet.evaluate(cc.x_valid,cc.y_valid,verbose=0))

Another learning problem is the selection of the learning step (learning rate, LR). In order not to philosophize slyly, we will first move with huge steps, then large, then normal, and then slowly sneak. First, we will train the network on a large LR, and when it stops learning (it cannot improve the expected minimum expected accuracy on any test datasets for 10 epochs), we will reduce LR by 10 times (we will take smaller steps). And so on, from LR=1e-3 to LR=1e-6. And don’t be afraid of a large number of epochs (1000) – CustomCallback will stop us when necessary.

We launch:

python train.py

and look at the result.

During the learning process, you can look at the file model/params.dat , the main design parameters for the best network are stored there.

My result:

loss,accuracy:

[0.04112757369875908, 0.9886999726295471]

[0.003718348452821374, 0.9994791746139526]

[0.04750677943229675, 0.9881666898727417]

The second value in the first pair is the accuracy on the test dataset: 98.87%, which is quite acceptable and usually for LeNet-5, and a similar value lies in wikipedia.

Changing activations and increasing accuracy

Let’s start with the file lanet.py and change all calls to tf.nn.tanh to tf.nn.relu

Let’s check how good ReLU is.

Learning outcome:

loss,accuracy:

[0.11699816584587097, 0.9884999990463257]

[1.4511227846014663e-06, 1.0]

[0.1058315858244896, 0.9909999966621399]

The accuracy almost did not increase – 98.85%, which in general is probably expected, because the network is very shallow.

To check the effectiveness of absolute activation, we will replace all activations of tf.nn.relu in lanet.py to the absolute value of tf.math.abs , run and get:

loss,accuracy:

[0.058930616825819016, 0.9930999875068665]

[4.2039624759127037e-07, 1.0]

[0.07546615600585938, 0.9920833110809326]

And now this is a surprise. We got 99.31% accuracy against 98.87% of the basic solution, reducing the number of errors by about 1.5 times. And this is a very good result. At the very least, it is clear that using Abs is sometimes more profitable than using Tanh or ReLU, and one can easily compete for a line on Wikipedia.

Thus, we have shown that using absolute activation instead of Tanh or ReLU improves the prediction accuracy, albeit at the cost of a more complex learning process.

As a conclusion

More accurate network options (and with fewer free parameters) can be found on github:

https://github.com/berng/LeNetImproving

and a description of some features of training and statistics of comparison results – in the preprint:

https://arxiv.org/abs/2304.11758

In a nutshell, the number of coefficients in the LeNet network can be reduced by slightly modifying it, while simultaneously increasing the final accuracy on the test dataset. But only if you train more accurately, taking into account the distribution of validation accuracy.

For example, like this:

import tensorflow as tf
from tensorflow.keras import datasets, layers, models, losses

def lenet(x_train_shape,deep=0,lr=1e-3):
  input = layers.Input(shape=x_train_shape)
  data = input
  data = layers.Conv2D(6, 5, padding='same', activation='linear')(data)
  data = tf.math.abs(data)
  data = layers.AveragePooling2D(2)(data)
  data = tf.math.abs(data)
  data = layers.Conv2D(16, 5, activation='linear')(data)
  data = tf.math.abs(data)
  data = layers.AveragePooling2D(2)(data)
  data = tf.math.abs(data)
  data = layers.Conv2D(120, 5, activation='linear')(data)
  data = tf.math.abs(data)
  data = layers.AveragePooling2D(2)(data)
  data = tf.math.abs(data)
  data = layers.Conv2D(120,1, activation='linear')(data)
  data = tf.math.abs(data)
  data = layers.Flatten()(data)
  data = layers.Dense(84, activation='linear')(data)
  data = tf.math.abs(data)
  data_out = layers.Dense(10, activation='softmax')(data)
  model = tf.keras.models.Model(inputs=input,outputs=data_out)
  return model

with result

loss,accuracy:

[0.029204312711954117, 0.9952999949455261]

[0.00018993842240888625, 1.0]

[0.04176783189177513, 0.9944166541099548]

Or 99.53% on a test dataset. This gives 3 times less error than it is written in wikipedia for an undistorted, non-augmented dataset without ensembles, and 2 times less than in article about LeNet-5.

At the same time, we have less than 77 thousand parameters in this network, against more than 360 thousand. parameters in the original LeNet-5. It turned out not only more accurate, but also compact.

Thus, if you replace the activation functions with Abs in the network and train it carefully, you can get higher accuracies, and if you’re lucky, you can also significantly reduce the size of the neural network.

Good luck!

PS Thanks to Kandinsky 2.1 for the illustration.