A short guide to quantizing neural networks

We have written enough articles about optimizing your neural networks, today it’s time to move on to fragmentation, reduction and direct pruning, aka quantization of data. This process itself is not complicated from the point of view of everything, but the operation has pitfalls.

We literally reduce the bit size of the data, which reduces computational resources and reduces the amount of memory required to store models.

Our card from Nvidia uses cheap, for example, 8-bit cores to calculate matrix convolution/multiplication operations – we get a cheap model. Of course, such trimming of floating point numbers can also lead to a decrease in precision. Catastrophic.

We have come up with different quantization methods, each of which has its own characteristics, approaches and applications.

They are divided according to three criteria: uniform and non-uniform quantization, symmetric and asymmetric quantization, as well as static and dynamic quantization. We won't go deeper. The main thing is that quantization can be reduced not only to 8-bit, but also to 16…

Where data has a high mass distribution from -1 to 1, the value is likely to be within the range. The most important thing is that quantization is always an approximation, which can cost you a lot. If you decide to reduce the amount of memory several times and literally convert float32 to int8, especially… That is, from floating point to integer values.

If everything is bad, then, for example, static quantization is given. Otherwise, the model is immediately trained on “quantized” data.

Two principles of quantization

Practical implementation goes, for example, through Post-training quantization (PTQ) and is based on the post-training transformation of a model that has already completed training on data with high precision numbers, usually 32-bit floating point. This is the case when we expect our supermodel to survive from such a reduction in the cost of data.

Therefore, the main goal of PTQ is to minimize the consumption of memory and computing resources during the inference phase without the need to retrain the model.

In PTQ, model weights and, in some cases, activations are converted to smaller integer values, most often 8-bit integers (int8), which can significantly reduce the size of the model and speed up calculations through the use of SIMD (Single Instruction, Multiple Data) instructions ) at the hardware level.

SIMD operations process multiple data with a single instruction. This is how they differ from traditional/scalar operations.

In PTQ, there is no change in the neural network architecture and the quantization algorithm is executed separately from the training process.

The main steps involve quantizing the floating point weights into an int8 by calculating scales and zero offsets to preserve a range of values. This is done using statistical information collected from a small amount of training data.

It is important that through PTQ, activations and weights can be quantized differently: activations can be quantized dynamically at the inference stage depending on the input data, while weights can be quantized statically based on a priori (initial) statistics.

If you are really working on a deep, deep neural network, then there may be problems with accuracy, the same applies to problems with high sensitivity in the data.

Quantization-Aware Training (QAT) is more complicated – here quantization is taken into account already at the model training stage.

Unlike PTQ, in QAT the weights and activations of the model are represented in a low-bit format (int8 or int16) throughout the training process, which allows the model to adapt to limited numerical precision.

The QAT architecture assumes that quantized versions of the weights are not used directly, but through emulation of the quantization process during the forward and backward passes of the model.

In the forward pass, weights and activations are modeled as quantized integers, which allows one to effectively emulate the inference process in a quantized environment. The backward pass, however, is used with floating point weights, which preserves the accuracy of gradient descent and allows the model to be adjusted under conditions of limited accuracy.

During training, the model “learns” to compensate for errors caused by quantization, which reduces the loss of accuracy observed with PTQ. QAT requires significantly more computational resources during the training phase, since training must take into account the quantization of all intermediate activations and weights.

In this case, it is necessary to perform quantization simulation not only for the weights, but also for the input data at each layer, which increases the computational complexity of the model during the training stage.

To implement QAT, it is necessary to modify the standard layers of the neural network so that they support low-bit calculations, as well as correctly configure the error backpropagation mechanisms.

The use of QAT is often associated with tasks such as deployment on mobile devices, where computing resources and memory are limited. Therefore, such a quantization architecture is often used for CV tasks, when we install a camera with a microprocessor and wait for a detection miracle…

How does all this work in practice?

In TensorFlow, quantization is implemented through TensorFlow Lite – a lightweight version of TensorFlow, specifically designed for deploying models on devices with limited resources, like raspberry pi)))

PTQ in TensorFlow Lite can be performed through the use of post-training quantization method, where the model trained on floating point is automatically converted to a quantized version using the function converter.optimizations = [tf.lite.Optimize.DEFAULT].

This process involves statically quantizing the weights using a small set of calibration data to calculate scales and zero offsets. An example code in TensorFlow Lite for PTQ might look like this:

import tensorflow as tf

# Загрузка модели
model = tf.keras.models.load_model('model.h5')

# Конвертация модели с использованием PTQ
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

# Сохранение квантованной модели
with open('model_quantized.tflite', 'wb') as f:
    f.write(tflite_model)

For the complex QAT quantization scenario, TensorFlow uses built-in quantization functionality during the training phase, which allows quantized representations of weights and activations to be taken into account during the training process.

Everything goes through tf.quantization.fake_quant_with_min_max_vars, where quantization is simulated in the forward and backward passes.

However, this requires more detailed network configuration and specific changes during the training process.

In PyTorch, quantization is supported through the torch.quantization package, which allows both post-training quantization and QAT.

PyTorch is modular in its approach, allowing coders to choose between symmetric, asymmetric quantization, and implement both dynamic activation quantization and static weight quantization.

Before quantization using PTQ, PyTorch includes preparing the model through the torch.quantization.prepare() functions and converting through torch.quantization.convert().

Example code for PTQ in PyTorch:

import torch
import torch.quantization

# Загрузка обученной модели
model = MyPretrainedModel()

# Подготовка модели к квантованию
model.eval()
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
torch.quantization.prepare(model, inplace=True)

# Применение PTQ
torch.quantization.convert(model, inplace=True)

# Тестирование квантованной модели
output = model(input_data)

QAT in PyTorch uses a similar structure, but with the addition of a quantization-aware training procedure.

The model is first prepared via torch.quantization.prepare_qat(), and then training continues, where quantization simulation occurs.

It is important to note that during training the model works with floating point weights, but during the inference stage it is converted to int8.

ONNX (Open Neural Network Exchange) is an open format for representing deep learning models that makes models portable between different frameworks.

For quantization, ONNX uses onnxruntime, which supports both static and dynamic quantization.

Static quantization in ONNX works through calibration data – it is used to calculate quantized values ​​before inference begins, while dynamic quantization is applied only to weights, which simplifies the process and reduces the load on inference.

An example of quantizing a model in ONNX might look like this:

import onnx from onnxruntime.quantization import quantize_dynamic, QuantType

model_fp32 = 'model.onnx' model_quant="model_quant.onnx"

Quantization with distillation and pruning

Often the quantization process is carried out in conjunction with other optimization methods – the same pruning or distillation.

Pruning is a method in which unnecessary or inactive neurons and connections are removed from the network without significantly affecting its performance.

Pruning can be performed based on various criteria, such as weights magnitude pruning, where weights whose values ​​are minimal are removed, or sensitivity analysis pruning, where the contribution of each neuron to the overall error of the model is estimated.

In combination with quantization, pruning can significantly reduce the number of calculations, since after thinning only active neurons remain on which quantization can be applied.

For example, after pruning, the model can be converted to a quantized version with fewer parameters, further reducing its computational complexity.

A practical implementation of pruning followed by quantization in TensorFlow might look like this:

import tensorflow_model_optimization as tfmot


model = tf.keras.models.load_model('model.h5')


prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude
pruned_model = prune_low_magnitude(model)


pruned_model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
pruned_model.fit(train_data, train_labels, epochs=2)


converter = tf.lite.TFLiteConverter.from_keras_model(pruned_model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quantized_model = converter.convert()


with open('model_pruned_quantized.tflite', 'wb') as f:
    f.write(tflite_quantized_model)

In order.

First we load the model using tf.keras.models.load_model('model.h5').

This model can be a pre-trained neural network, for example, for an image classification or speech recognition task.

To thin out the model, the prune_low_magnitude method is used, which is part of the tensorflow_model_optimization library.

The method removes connections (neural weights) whose values ​​are close to zero, thereby reducing the size of the model and reducing the amount of calculations it requires.

As a result, a pruned_model version of the model is created, in which some of the parameters are reset. This helps reduce model complexity and speed up model execution without significant loss of accuracy.

After applying the thinning method, the model is compiled and retrained on training data using the fit() method. This is necessary for the model to adapt to the changed structure, where some of the neural connections have been removed.

After training is completed, the model undergoes post-training quantization using TFLiteConverter.

This process involves converting model weights from a 32-bit representation (FP32) to an 8-bit integer representation (int8), which significantly reduces the memory footprint of the model and speeds up inference.

In this case, optimizations specified via converter.optimizations = are used [tf.lite.Optimize.DEFAULT]. The model is then saved in TFLite format, making it easy to deploy on devices with limited computing resources, such as microcontrollers and mobile devices.

The principle of distillation is that the “larger” model (teacher model) trains the “smaller” model (student model), transferring its knowledge to it in the form of predictions.

During the distillation process, the teacher model generates probability distributions of classes, which are then used to train the student model.

These distributions, also called soft labels, contain more complete information than hard labels because they reflect the teacher model's confidence in each class.

We immediately give an example of distillation with quantization.

An example distillation process in PyTorch might look like this:

import torch
import torch.nn.functional as F
def distillation_loss(student_output, teacher_output, labels, T, alpha):
    soft_loss = F.kl_div(F.log_softmax(student_output / T, dim=1),
                         F.softmax(teacher_output / T, dim=1), reduction='batchmean') * (T * T)
    hard_loss = F.cross_entropy(student_output, labels)
    return soft_loss * alpha + hard_loss * (1. - alpha)
  for data, labels in train_loader:
    student_output = student_model(data)
    with torch.no_grad():
        teacher_output = teacher_model(data)
    loss = distillation_loss(student_output, teacher_output, labels, T=4.0, alpha=0.7)
    loss.backward()
    optimizer.step()

Function distillation_loss combines two components:

  1. soft losswhich is calculated using the prediction distributions of the teacher and student models normalized by temperature.

This helps convey more granular information about class probabilities to the learner model, rather than just the correct class (hard labels), making the learning process more informative.

  1. hard loss is a standard cross-entropy function that measures the distance between the student model's predictions and the actual class labels.

The combination of these two components (depending on the value of the parameter α) allows the student model to learn better from the predictions of the teacher model.

The process of training a student model:
In this code, the learner model's training loop looks like this:

For each piece of data (mini-batch), inference is performed on both the student model and the teacher model.

The predictions of the teacher model are passed to the loss function, where they are used to calculate the soft loss.

The loss function combines this with traditional cross-entropy (hard loss), training the student model more efficiently.

An optimization step is then performed and the learner model updates its weights based on the resulting combined loss function.

After training a student model using the distillation method, it can be quantized using standard methods such as dynamic or static quantization, further reducing its size and resource consumption.

Integrating all three methods—quantization, pruning, and distillation—represents a powerful approach to model compression.

In real-world scenarios such as mobile devices or embedded systems, this can achieve significant improvements in model execution speed and energy efficiency.

For example, if a model is first thinned to remove unimportant relationships, then distilled to create a lightweight version, and finally quantized, significant reductions in computational costs can be achieved without losing critical accuracy.

This was a short guide to quantization. We hope it was useful for some, especially beginners.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *