TinyML. Compressing the neural network

Now programmers are faced with a difficult task – how to implement such a cumbersome structure as a neural network – in, say, a bracelet? How to optimize the power consumption of the model? What is the price of such optimizations, as well as how justified is the introduction of models into small devices, and why it is impossible to do without it.

And what is the use?

Let’s imagine an expensive industrial sensor – 1000 measurements per second, a temperature sensor, vibration measurement, data transmission over 10 km, a powerful processor – 20 million operations per second! Its job is to send data about temperature, vibration, and other parameters to the server to prevent equipment breakdowns. But here’s the bad luck – 99% of the data sent by him is useless, from it – a net loss for electricity. And there can be tens and hundreds of such sensors in production.

In reality, we are not interested in the data from this device itself, but in the insights from them – is everything working as usual? Are there any emergencies? Perhaps repairs will be needed soon? So why not deploy the neuron to the sensor itself, and instead of an endless stream of data, only sometimes send signals “Everything is fine” or “Performance anomalies!” This is exactly what TinyML is about.

Do you see a very strange peak in the middle?  Even a person who monitors the readings of the instruments is hard to notice him, but the ML model will cope with this easily, and "will not blink" moment
Do you see a very strange peak in the middle? Even a person who monitors the readings of the instruments is hard to notice him, but the ML model will cope with this easily, and “will not blink” moment

Everything revolves around how we can squeeze the model as much as possible so that it fits into a small device. As “devices” everything will do: kettle, industrial sensor, iron, telephone, bracelet, etc.

Unlike cloud-based, embedded ("built-in") AI works on the device itself
Unlike cloud-based, embedded (“built-in”) AI works on the device itself

Advantages of the approach

Firstly, saving resources. Since constant communication with the server is not required, this significantly saves energy, because you can do without a constant connection to WiFi, Bluetooth, and so on.

Secondly, fast work speed. It takes too long to transfer data to the server when the result is needed “here and immediately”…

The third is savings on cloud computing. In the cloud approach, the data needs to be sent to the server not only for training the model, but also for the predictive. Imagine that a face change application on your phone will constantly require an Internet connection, as is the case with a navigator … Very inconvenient and costly. Actually, that’s why such technologies are already built into our phone (and this is the work of TinyML).

Fourth, security. Sending data somewhere is always risky, it is much safer to receive the result already on the device, and send only the predicted from the device.

Fifth, the speed of the neural networks themselves becomes faster, because working with int inside a neuron is faster than with float – I will talk about this below.

Quantization

And to help shove a fat neuron into a slender sensor will help such a trick as Quantize. The essence of the method is simple – let’s reduce the space occupied by numbers in memory. Conventional neurons use a data type such as a thick 32-bit float. What happens if we replace them with lean 8 bit ints? They will take up less space, but the quality will also decline.

An unattainable dream is to use 1 bit. Then we will get a gigantic gain in size. It’s a pitty it’s impossible.

Or maybe? Binarized convolutional neural networks at your service. You can read more about them. here

Matrix multiplication in this case looks like "Little" otherwise
Matrix multiplication in this case looks like “Little” differently

From theory to practice

And now, to make it more perky, a little code. Let’s make the simplest model predicting the sine of a number

# Генерируем данные для нашей нейронки
x_values = np.random.uniform(
    low=0, high=2*math.pi, size=1000).astype(np.float32)

# Перемешиваем
np.random.shuffle(x_values)

y_values = np.sin(x_values).astype(np.float32)

# Добавляем шум, чтобы было "как в реальной жизни"
y_values += 0.1 * np.random.randn(*y_values.shape)

plt.plot(x_values, y_values, 'b.')
plt.show()

The data is ready. Now we divide them into training, test and validation samples (I will skip this and some other parts of the code to save your time, a complete notebook with the code here here)

Time to build our neuron

# Создаём нейронную сеть
model = tf.keras.Sequential()

model.add(keras.layers.Dense(16, activation='relu', input_shape=(1,)))

model.add(keras.layers.Dense(16, activation='relu'))

model.add(keras.layers.Dense(1))

model.compile(optimizer="adam", loss="mse", metrics=["mae"])

# Тренируем модельку
history = model.fit(x_train, y_train, epochs=500, batch_size=64,
                    validation_data=(x_validate, y_validate))

# Сохраняем
model.save(MODEL_TF)

Now let’s see what happened

The resulting quality is fine for our experiments.
The resulting quality is fine for our experiments.

So, the basic model is ready. The time has come “squeeze” her in many different ways. And TFLiteConverter, created specifically to facilitate your neurons, will help us with this.

# Конвертируем нашу нейронку в формат TensorFlow Lite  БЕЗ квантизации
converter = tf.lite.TFLiteConverter.from_saved_model(MODEL_TF)
model_no_quant_tflite = converter.convert()

# Сохраняем
open(MODEL_NO_QUANT_TFLITE, "wb").write(model_no_quant_tflite)

# Конвертируем нашу нейронку в формат TensorFlow Lite ИСПОЛЬЗУЯ квантизацию
def representative_dataset():
  for i in range(500):
    yield([x_train[i].reshape(1, 1)])

converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Задаем параметры, по котором в нейронке всё будет конвертировано в int
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

# Задаем репрезентативный набор данных, чтобы обеспечить правильность квантизации
converter.representative_dataset = representative_dataset
model_tflite = converter.convert()

open(MODEL_TFLITE, "wb").write(model_tflite)

In total, we have three neural networks: regular, converted TensorFlow Lite without quantization, converted TensorFlow Lite with quantization. It’s time to compare how much space they take up

pd.DataFrame.from_records(
    [["TensorFlow", f"{size_tf} bytes", ""],
     ["TensorFlow Lite", f"{size_no_quant_tflite} bytes ", f"(reduced by {size_tf - size_no_quant_tflite} bytes)"],
     ["TensorFlow Lite Quantized", f"{size_tflite} bytes", f"(reduced by {size_no_quant_tflite - size_tflite} bytes)"]],
     columns = ["Model", "Size", ""], index="Model")

So, compared to the original neuron, converting with TensorFlow Lite gives a 32% gain in size, and quantizing as much as 40%! A very impressive result, however, at what cost did we achieve it?

The loss of quality is almost insignificant, at the level of error. But it should be borne in mind that we are using a very simple model for the test, on larger models the result may not be so optimistic!

Conversion to C code

We did all the previous steps in Python in a laptop, but we are interested in how to deploy a model into a microcontroller, right? And for this, it is required to convert the resulting model into code with which these microcontrollers themselves are used to working. IMPORTANT! I ran the code below on Ubuntu, if you want to do this on Windows, you have to look for workarounds.

# Ставим xxd
!apt-get update && apt-get -qq install xxd
# А теперь конвертируем
!xxd -i {MODEL_TFLITE} > {MODEL_TFLITE_MICRO}
# Меняем имя переменных
REPLACE_TEXT = MODEL_TFLITE.replace("https://habr.com/", '_').replace('.', '_')
!sed -i 's/'{REPLACE_TEXT}'/g_model/g' {MODEL_TFLITE_MICRO}
# Давайте глянем, что из себя представляет сконвертированная в C код модель
!cat {MODEL_TFLITE_MICRO}

I only inserted the first few lines of the converted model, but in general there are about 400 of them.

Microcontroller code

We have a model, and now let’s see how our C code will look on, in fact, a microcontroller. Before deploying to the microcontroller, the code can (and should) be run on the computer first. I’ll say right away that I pulled out only the most interesting pieces of code, but if you want to run it on your computer, then behold


// Определяем входное значение, а также ожидаемое выходное 

  float x = 0.0f;
  float y_true = sin(x);

  // Логирование
  tflite::MicroErrorReporter micro_error_reporter;

  // Подтягиваем ранее сохранённую модельку
  const tflite::Model* model = ::tflite::GetModel(g_model);

Next, we allocate space on our device. How much to take? In fact, the question is solved head-on: we take a certain random, reasonable number. If the model fits and everything works well, we try to reduce the space. Continue this process until the system stops working.

And finally, this is how the sine prediction looks like on the microcontroller.

  x = 5.f;
  y_true = sin(x);
  input->data.int8[0] = x / input_scale + input_zero_point;
  interpreter.Invoke();
  y_pred = (output->data.int8[0] - output_zero_point) * output_scale;
  TF_LITE_MICRO_EXPECT_NEAR(y_true, y_pred, epsilon);

Edge impulse

We should also mention the Edge platform Impulse from the guys who are closely involved with TinyML.

It takes on a lot of work on deploying models directly to microcontrollers, you just need to connect some kind of arduino to a computer, and roll a model onto it in a couple of clicks. I didn’t use it myself, and I don’t think that it will be possible to do something very serious on its basis, but for those who want to play a little – definitely here.

Well, instead of a conclusion – the topic with TinyML is gaining momentum. In some areas (bracelets for tracking the state of people with heart disease, detection of tongue cancer with a built-in neural network through a photograph, etc.), it simply has no alternatives. The growth in the number of similar devices predicted at the level of 20% per year, which means that we will hear about this technology more and more often.

If you want to know more on the topic, then join our NoML Community – https://t.me/noml_community

List of sources

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *