Quantization allows you to run Llama 3.2 on mobile phones
Quantization helped port the latest version LLM Llama 3.2 on mobile platforms – iOS and Android. For this purpose, the developers have released quantized versions Llama 3.2 1B and 3Bwhich, when tested on ARM processors, showed a high speed of inference compared to uncompressed weights in the BF16 format.
How did it happen that Llama works on mobile processors, because to run it you need a certain software stack, most often a library Pytorch And CUDA on Linux operating system?
The point is that Meta* (recognized as an extremist organization in Russia) use ExecuTorch is a framework that is part of the Pytorch platform and is designed to run Pytorch programs on mobile devices. ExecuTorch supported by the Llama Stack framework for running Llama models, namely lightweight ones Llama 3.2 1B And 3Bon iOS and Android. To develop mobile applications for these platforms, Llama Stack provides a client SDK in Swift for iOS and Kotlin for Android, both written for the ExecuTorch backend.
What level of performance exactly can Llama's new quantized models achieve?
On average, this is an inference acceleration of two to four times compared to scales in the BF16 format, while maintaining almost comparable quality. Reducing the size of the model by 56% – which is important for a mobile application, so that it takes up less space on the phone – and reducing the amount of memory consumed by 41%. All this is according to the benchmark results provided on the Llama website.
It’s worth noting an important detail right away: this is not an ordinary post-training quantizationwhen you take the weights in FP16 and quantize in GGUF or GPTQ. While such weights certainly have practical applications for a variety of tasks, they suffer from a drop in quality, which is clearly visible in the benchmarks below.
Traditional way – QLoRA
For quantization, two different approaches were used, firstly, QLoRA – a familiar technique when, after quantizing weight matrices in 4bit, l is applied to themow-rank-adaptation. This approach is still very effective and has performed better in benchmarks.
QLoRA is fairly easy to do yourself, but it requires a suitable GPU, a base model with 16-bit weights, and a dataset. That is, the model’s weights are compressed into 4bit and fine tuned on the data using LoRA, low-rank adaptation. In other words, only the parameters of LoRA adapters – lower order matrices – are trained. Such smart fine tuning gives what you will see below in the benchmarks below – the QLoRA model is very close in quality to the scales in the original resolution.
However, this method still requires a video card – it must have enough memory to accommodate the weights in four-bit format. I rent GPU in the cloud for fine tuning using QLoRA, and most often I work with 8B models, so the power is not that huge.
But still, is it possible to achieve similar results without training, by post-training quantization?
Advanced post-training quantization: SpinQuant
Another approach alternative to QLoRA is SpinQuantwhich allows models to be quantized after training. That is, to perform quantization you do not need a dataset and video cards for training, unlike QLoRA. Or rather, you need a video card, but you won’t have to really train it. SpinQuant involves two manipulations with model weights – rotation of activation matrices and weights and regular PTQ quantization.
The problem with quantization is the outlayer values, which deviate greatly from the average range of the dataset. They may cause loss of prediction accuracy after quantization. To combat them, various methods are used to reduce the spread of values in the X matrix – such as normalization or multiplication of X by the rotation matrix. Read more about this in article on SpinQuant.
Eat open repository in Python and Pytorch, which offers a SpinQuant implementation compatible with ExecuTorch and Llama Stack. With its help, it is convenient to quantize weights for different platforms, including mobile ones.
For an example of using SpinQuant in Google Colab, see my video:
Here are the results of a detailed comparison of models with different types of quantization. Presented are metrics obtained on uncompressed scales in BF16, on scales after conventional post-training quantization – there is a noticeable drop in quality – and on scales after SpinQuant and QLoRA. The latter, especially QLoRA, show results in benchmarks very close to the original model.
SpinQuant and QLoRA have approximately the same inference speed, although QLoRA consumes slightly more memory. Speed is more than twice that of uncompressed BF-16 scales.
Quantization gives us cross-platform functionality and makes large language models more accessible to developers. Let’s say someone didn’t look towards Llama and other models because they didn’t work on the platform they were used to. Not everyone likes to write programs to run in the cloud.
But now even mobile developers have tools that will allow them to start exploring the possibilities Generative AI.