Perceptron on numpy

I am of the opinion that if you want to understand something, then implement it yourself. When I first started doing data science, I figured out how to calculate gradients on a piece of paper, jumped the stage of implementing grids to numpy and immediately began to train them. However, when, after a long time, I nevertheless decided to do this, I was faced with the fact that I could not do it, because my dimensions did not converge.

After going through a lot of materials, I settled on a book Deep Learning from Scratch. Now I figured it out, and I want to make my own tutorial.

To the question “Why another tutorial with a grid on numpy” I will answer:

  • In the tutorial, accents are made in non-obvious places where dimensions may not converge;

  • There are no abstractions in the code (layer classes) so as not to distract from the essence.

Code available here. You can also see This Video off course deep learning on fingersto calculate gradients on paper.

In order to train a neural network, we need to understand the chain rule (complex function differentiation). This rule describes how to take the derivative of function compositions. If we have an expression y = f(g(x))then the derivative of y with respect to x.

\frac{dy}{dx} = \frac{dy}{du} .  \frac{du}{dx}, u = g(x)

Since the neural network is a large composition of functions (layers), in order to calculate the derivative (gradient) of the error, we multiply the derivatives of all layers.

We will train a two-layer perceptron (two fully connected layers) with a sigmoid activation function between layers for the regression problem (predicting the price of a house) on dataset house prices in california.

w1 = np.random.randn(in_dim, hidden_dim)
b1 = np.zeros((1, hidden_dim))
w2 = np.random.randn(hidden_dim, out_dim)
b2 = np.zeros((1, out_dim))

The neural network can be represented as the following computational graph:

Computational network graph
Computational network graph

y = B(C(D(E(F(X, w1), b1)), w2), b2)

This is just composition, and we need to know the derivative of the error function for all weights. They will be like this:

\frac{dL}{db_2} = \frac{dL}{dA} .  \frac{dA}{dB} .  \frac{dB}{db_2}\frac{dL}{dw_2} = \frac{dL}{dA} .  \frac{dA}{dB} .  \frac{dB}{dC} .  \frac{dC}{dw_2}\frac{dL}{db_2} = \frac{dL}{dA} .  \frac{dA}{dB} .  \frac{dB}{dC} .  \frac{dC}{dD} .  \frac{dD}{dE} .  \frac{dE}{db_2}\frac{dL}{dw_1} = \frac{dL}{dA} .  \frac{dA}{dB} .  \frac{dB}{dC} .  \frac{dC}{dD} .  \frac{dD}{dE} .  \frac{dE}{dF} .  \frac{dF}{dw_1}

Let’s calculate all intermediate derivatives:

L = A^2;  \frac{dL}{dA} = 2aA = Y_{true} - Y_{pred};  \frac{dA}{dB} = -1

The following derivative may have dimensional problems. The unit here is a unit vector with the same dimension as C np.ones_like(C).

B=C+b_2;  \frac{dB}{dC} = 1

Similarly np.ones_like(b2).

B=C+b_2;  \frac{dB}{db2} = 1

Here, too, you have to be careful. Since the derivative with respect to D, which is on the left, we need to transpose w2 and put it on the right when multiplying with another matrix np.dot(prev_grad, w2.T).

C=D@w2;  \frac{dC}{dD} = w2^T

Similarly, we need to transpose D and put it on the left when multiplying with another matrix np.dot(D.T, prev_grad)

C=D@w2;  \frac{dC}{dw2} = D^T

The sigmoid has a cool derivative.

D = sigmoid(E);  \frac{dD}{dE} = sigmoid(E)*(1 - sigmoid(E)

Here np.ones_like(b1).

E = F + b_1;  \frac{dE}{dF} = 1

Here np.ones_like(b1).

E = F + b_1;  \frac{dE}{db1} = 1

Here you need to transpose D and put it on the right when multiplying with another matrix np.dot(prev_grad, X.T)

E=X@w_1;  \frac{dF}{dw1} = X^T

In code it looks like this:

dLdA = 2 * A  # (bs, out_dim)
dAdB = -1  # (bs, out_dim)
dBdC = np.ones_like(C)  # (bs, out_dim)
dBdb2 = np.ones_like(self.B2)  # (bs, out_dim)
dCdD = self.W2.T  # (out_dim, hidden_dim)
dCdw2 = D.T  # (hidden_dim, bs)
dDdE = D * (1 - D)  # (bs, hidden_dim)
dEdF = np.ones_like(F)  # (bs, hidden_dim)
dEdb1 = np.ones_like(self.B1)  # (bs, hidden_dim)
dFdw1 = X.T  # (in_dim, bs)

dLdb2 = np.mean(dLdA * dAdB * dBdb2, axis=0, keepdims=True)  # (1, out_dim)
dLdw2 = np.dot(dCdw2, dLdA * dAdB * dBdC)  # (bs, out_dim)
dLdb1 = np.mean(
  np.dot(dLdA * dAdB * dBdC, dCdD) * dDdE * dEdb1, axis=0, keepdims=True
)  # (1, hidden_dim)
dLdw1 = np.dot(
  dFdw1, np.dot(dLdA * dAdB * dBdC, dCdD) * dDdE * dEdF
)  # (bs, in_dim)

It remains only to update the weights in accordance with the computed gradients. Since the gradient shows how much the weights contribute to the error function growing, we subtract it. That is, we take a step to the side so that the error decreases.

b2 -= self.lr * dLdb2
w2 -= self.lr * dLdw2
b1 -= self.lr * dLdb1
w1 -= self.lr * dLdw1

As they say, “Bunch of firewood and the perceptron is ready.” I hope this tutorial will help those who faced the same problem as me.

And also I have telegram channelwhere I talk about grids with an emphasis on inference.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *