Python Killer and the Future of Ai?

Hi everyone! My name is Vadim, I am a Data Scientist at Raft, and today we are going to dive into Mojo. I have already done overview of this programming language and looked at its advantages, usage examples, and compared it with Python.

Now let's look at how to train a simple convolutional neural network, and we'll look at one of the machine learning methods — linear regression. As examples of tasks, we'll take standard machine learning competitions: predicting housing prices and classifying handwritten digits MNIST. To conduct experiments in Python, we'll use the PyTorch machine learning framework. And on Mojo, we'll use the Basalt machine learning framework.

A little bit about datasets

MNIST (Modified National Institute of Standards and Technology) — is a dataset for the task of recognizing handwritten digits from 0 to 9. The dataset consists of 70 thousand images with a resolution of 28×28, black with a white digit. The task is to recognize the digit depicted in the image.

Example MNIST data

Example MNIST data

Housing Prices Dataset — is a data set for predicting the cost of housing based on certain features. For example, the area of ​​the plot, the type of housing, the presence of a garage, the number of rooms, and so on.

House Price Dataset Example

House Price Dataset Example

Diving into the code

Experiment on MNIST

To solve the problem of classifying handwritten digits, we will write a simple CNN (convolutional neural network), which will consist of 2 parts:

  • construction of a feature map, implemented through 2 layers of convolutions;

  • classifier consisting of 3 fully connected layers.

The architecture is presented in more detail in Table 1.

Layer

Future map

Size

Kernel size

Stride

Padding

Activation

Input

Image

1

28×28

1

Convolution

16

28×28

5×5

1

2

ReLU

2

Maxpool

16

14×14

2×2

0

0

3

Convolution

32

14×14

5×5

1

2

ReLU

4

Maxpool

32

7×7

2×2

0

0

5

FC

120

ReLU

6

FC

184

ReLU

Output

FC

10

Training hyperparameters:

The implementation of the network architecture in Python and Mojo is slightly different. In the first case, using PyTorch, we could define the architecture as a sequence of blocks.

class CNN(nn.Module):

    def __init__(self):

        super(CNN, self).__init__()

        self.block1 = nn.Sequential(

            nn.Conv2d(in_channels=1, out_channels=16, kernel_size=5, padding=2),

            nn.ReLU(),

            nn.MaxPool2d(kernel_size=2),

        )

        self.block2 = nn.Sequential(

            nn.Conv2d(in_channels=16,out_channels= 32, kernel_size=5, padding=2),

            nn.ReLU(),

            nn.MaxPool2d(kernel_size=2),

        )

        self.fc1 = nn.Linear(in_features=32 * 7 * 7, out_features=120)

        self.fc2 = nn.Linear(in_features=120, out_features=84)

        self.out = nn.Linear(in_features=84, out_features=10)

    def forward(self, x):

        x = self.block1(x)

        x = self.block2(x)

        x = x.view(x.size(0), -1)

        x = nn.ReLU()(self.fc1(x))

        x = nn.ReLU()(self.fc2(x))

        return self.out(x)

In the case of Mojo, it is necessary to define the structure Graph, which implements the so-called computation graph. It is used for computations in feed forward and backpropagation.

fn create_CNN(batch_size: Int) -> Graph:

   # инициализируем граф и наш вход

    var g = Graph()

    var x = g.input(TensorShape(batch_size, 1, 28, 28))

   # инициализируем и применяем сверточные слои

    var conv1 = nn.Conv2d(g, x, out_channels=16, kernel_size=5, padding=2)

    var act_conv1 = nn.ReLU(g, conv1)

    var max_pool1 = nn.MaxPool2d(g, act_conv1, kernel_size=2)


    var conv2 = nn.Conv2d(g, max_pool1, out_channels=32, kernel_size=5, padding=2)

    var act_conv2 = nn.ReLU(g, conv2)

    var max_pool2 = nn.MaxPool2d(g, act_conv2, kernel_size=2)

    # переводим выходной тензор в вектор

    var x_reshape = g.op(

        OP.RESHAPE,

        max_pool2,

        attributes=AttributeVector(

            Attribute(

                "shape",

                TensorShape(max_pool2.shape[0], max_pool2.shape[1] * max_pool2.shape[2] * max_pool2.shape[3]),

            )

        ),

    )

    # классифицируем, извлеченные признаки, полносвязной сетью

    var fc1 = nn.Linear(g, x_reshape, n_outputs=120)

    var act_fc1 = nn.ReLU(g, fc1)

    var fc2 = nn.Linear(g, act_fc1, n_outputs=84)

    var act_fc2 = nn.ReLU(g, fc2)

    var out = nn.Linear(g, act_fc2, n_outputs=10)

    g.out(out)

    # считаем потери, используя CrossEntropyLoss

    var y_true = g.input(TensorShape(batch_size, 10))

    var loss = nn.CrossEntropyLoss(g, out, y_true)

    g.loss(loss)

    return g

Initialization of the model together with the optimizer and its training cycle in Python is quite standard for PyTorch: we run the entire dataset a certain number of times (epochs) by batches, determine the features (images) and labels for them (labels). Then we predict the class label, calculate the error and update the gradients.

cnn = CNN()

loss_func = nn.CrossEntropyLoss()

optimizer = optim.Adam(cnn.parameters(), lr=learning_rate)

cnn.train()

for epoch in range(num_epochs):

  for i, (images, labels) in enumerate(loaders["train"]):
    
     b_x = Variable(images)
  
     b_y = Variable(labels)
  
     output = cnn(b_x)
  
     loss = loss_func(output, b_y)
  
     optimizer.zero_grad()
  
     loss.backward()
  
     optimizer.step()

There are some minor differences on Mojo:

  1. it is necessary to define a function to execute the code;

  2. it is necessary to define the model and optimizer through the graph structure;

  3. Before submitting images to the network, you must perform one hot encoding tags.

Otherwise, the network training process is similar to the PyTorch style, with the exception of the language syntax features.

fn main():

    alias graph = create_CNN(batch_size)

    var model = nn.Model[graph]()

    var optim = nn.optim.Adam[graph](Reference(model.parameters), lr=learning_rate)

    for epoch in range(num_epochs):
    
            var num_batches: Int = 0
    
            var epoch_loss: Float32 = 0.0
    
            for batch in training_loader:
    
                var labels_one_hot = Tensor[dtype](batch.labels.dim(0), 10)
    
                for bb in range(batch.labels.dim(0)):
    
                    labels_one_hot[int((bb * 10 + batch.labels[bb]))] = 1.0
    
                var loss = model.forward(batch.data, labels_one_hot)
    
                optim.zero_grad()
    
                model.backward()
    
                optim.step()
    
                epoch_loss += loss[0]
    
                num_batches += 1

House price prediction

To solve this problem we will apply the standard linear regressionimplemented through one fully connected layer.

The training hyperparameters are as follows:

In Python, the code using PyTorch would look like this.

class LinearRegression(nn.Module):

    def __init__(self, input_dim):

        super(LinearRegression, self).__init__()

        self.linear = nn.Linear(in_features=input_dim, out_features=1)

    def forward(self, x):

        return self.linear(x)

On Mojo, we again need to define a Graph structure and a layer with a loss function through which the calculations will take place.

fn linear_regression(batch_size: Int, n_inputs: Int, n_outputs: Int) -> Graph:

    var g = Graph()

    var x = g.input(TensorShape(batch_size, n_inputs))

    var y_true = g.input(TensorShape(batch_size, n_outputs))

    var y_pred = nn.Linear(g, x, n_outputs)

    g.out(y_pred)

    var loss = nn.MSELoss(g, y_pred, y_true)

    g.loss(loss)

    return g

The training cycle is the same as shown for MNIST, except that there is no need for ohe hot encoding since the labels are already encoded.

Results

When training a convolutional network on MNIST and a linear regression on house price prediction, many experiments were conducted with different hyperparameter settings. The results with optimal values ​​by training time are presented in Table 2.

MNIST

House Price

Python

1.58 sec

23.18 sec

Mojo

4.89 sec

0.15 sec

On the MNIST classification task The Python programming language demonstrated the best performance in classifying handwritten digits. While Mojo showed less satisfactory results, which can be explained by the lack of optimization of convolutions in the current Mojo Basalt framework.

For the task of predicting the price of houses Python loses to Mojo in linear regression. Mojo shows good results, outperforming Python, which confirms the high performance of the language, especially in tasks involving linear calculations.

Conclusion

Mojo's potential is revealed in tasks where speed is important. Although at the moment it is not as good at working with neural networks as Python, since it has limited functionality. But with the development of frameworks and various optimizations, there will probably be improvements.

What do you think? Write in the comments!

Links:

  1. Repository with experiment code

  2. Kaggle Competition on MNIST

  3. House Price Prediction Kaggle Competition

  4. PyTorch Documentation

  5. Mojo Documentation

  6. Article about convolutions

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *