Neural ODE

Lorenz attractor

Lorenz attractor

This paper describes the idea of ​​neural ordinary differential equations (Neural ODEs), an approach in deep learning that combines numerical methods for solving differential equations with neural networks. Neural ODEs allow modeling continuous changes in hidden states, which opens up new possibilities for time series analysis, signal processing, and dynamic systems.

Taking this opportunity, I will leave a link to my channel – notmagicneuralnetworks

1. Background

1.1. ResNet

Early neural network architectures consisted primarily of linear layers and activation functions. Even in not very deep networks, the vanishing gradients problem was a serious one: when gradients are passed through many layers, they can gradually decrease to insignificant values. This can lead to the weights in the initial layers hardly updating and not learning effectively, which slows down or makes learning impossible.

Example

For example, let's say that after the linear layer there is a sigmoid activation function \sigma(x) = \frac{1}{1 + e^{-x}}During back propagation, the value of the derivative of the activation function is sought.\sigma'(x) = \sigma(x)(1 − \sigma(x))The figure below shows the sigmoid function and its derivative.

Sigmoid function and its derivative

Sigmoid function and its derivative

When searching for a gradient, you need to go through the derivative of the activation function. In this case, its maximum value is 0.25. If there are several layers, it may turn out that the gradients on the initial layers are almost zero. Because of this, the weights on the first layers can be updated very slowly. This is called the vanishing gradients problem.

This problem has been tackled in different ways, for example by using different activation functions, VGG using special training schemes, and GoogLeNet adding auxiliary loss functions. You can read a little more about it here Here.

In 2015, researchers from Microsoft proposed an architecture called the Residual neural network (or ResNet). It consisted of a residual block (residual connection, skip connection), where input data is passed through layers (blocks) unchanged and connected to the output data of the block:

Residual Block in recurrent notation:

x_t = x_{t-1} + f(x_{t-1})

Where x_{t-1}– data from the previous layer, f– neural network layer, x_{t}– current data.

def f(x, t, theta):
  return nnet(z, theta

def resnet(x, theta):
  for t in [1:T]:
    x = x + f(x, t, theta)
  return x

By throwing data across layers in this way, ResNet solves the vanishing gradient problem well.

The derivative of the loss function when it passes through the Residual Block will be equal to

\frac{dL}{dx} = \frac{dL}{d\varphi} \frac{d\varphi}{dx} = \frac{dL}{d\varphi} \left(1 + f'(x) \right)Where \varphi = x + f(x)

which as a result allows the gradient not to fade.

1.2. Euler method

The recurrent notation of ResNet is very similar to Euler's method, a numerical method for solving ordinary differential equations (ODE) with given initial conditions.

Let there be given a differential equation with initial conditions:

z(t_1)matched the neural network prediction \hat z(t_1).

Let's introduce the loss function:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *