Towards 1-bit machine learning models

Recently, extremely low-bit quantization technologies have been actively developed, for example, BitNet And 1.58 bit. They are of great interest in the machine learning community. The main idea of ​​this approach is that multiplication of matrices with quantized weights can also be implemented, which potentially A complete game changer for the computational speed and efficiency of large machine learning models.

This article is written in a similar vein, but we are most interested in whether it is possible to directly quantize pretrained models at extreme settings, including binary weights (0 and 1). Existing works are aimed at training models from scratch. But there are now quite a lot of excellent pre-trained models in the public domain, such as Llama2. Moreover, learning from scratch is a resource-intensive task in terms of both computation and data, so such approaches are not widely available in the free community.

In this article, we will take a closer look at extremely low-bit (2 and 1-bit) quantization of pretrained models using HQQ+. HQQ+ is an adaptation of HQQ (half-quadratic quantization) that uses a low-dimensional adapter to improve performance. Our results show that by training only a small fraction of the weights at the top of an HQQ-quantized model (even a single-bit one), the quality of the output increases significantly, such a model can even outperform small full-precision models.

Models are on Hugging Face: 1-bit, 2-bit.

Introduction

Quantizing small pre-trained models with extremely small bit widths is not an easy task. We have already demonstrated that relatively large models such as Mixtralcope well with quantization of 2-bit models. But, at the same time, smaller models, for example, the popular Llama2-7B, stall at such extreme quantization levels. Moreover, the quality drops seriously when quantizing 1-bit models.

The purpose of this experiment is to demonstrate to the community the results that are expected to be achieved by fine-tuning such models under the most extreme quantization conditions. We were surprised, but by fine-tuning just a tiny fraction of all parameters (approximately 0.65%), the quality of the output results is greatly improved. In particular, we observed:

  • 1-bit caseNote: If you directly apply 1-bit quantization to small models, in particular the Llama2-7B, suboptimal results are achieved. True, if the model is fine-tuned, the quality of its output improves significantly. It is noteworthy that the finely tuned 1-bit base model outperforms even 2-bit Quip#despite only being trained on ~2.8K tokens with a context window of 1024.

  • 2-bit case: If you provide more specialized data, the 2-bit model does very well. In fact, the base 2-bit model Llama2-7B with HQQ+ superior model of lightning accuracy in Wikitext processing. The chat model outperforms its own version of full accuracy on the GSM8K dataset if you give it enough math and reasoning information.

Efficient matrix multiplication with low-bit quantization

The HQQ dequantization step is a linear operation that requires both a calibration and a null parameter. This section will show you how to rewrite the dequantization step in such a way that you can directly take advantage of the benefits of multiplying low-bit matrices.

A new look at dequantization

Dequantization step in HQQ can be expressed as Wr=(Wq−z)s, where Wr corresponds to the dequantized weights, Wq to the quantized weights, and the metaparameters z and s to the zero and gauge coefficient vectors, respectively. In order not to complicate the explanation within the framework of this article, we will skip the description of the reconstruction stages necessary if we use grouping.

The operation of matrix multiplication during direct propagation (without taking into account the shift term) takes the form:

xWr=x((Wq−z)s)

To use such low-bit matrix multiplication, it is necessary to separate xWq from the rest of the expression. Let's rewrite this operation as follows:

xWr=x(Wqs+u),

where u=−z⊙s and ⊙ is the point multiplication (Hadamard product). Note that since u is a vector, direct matrix multiplication is not applicable between x and u. However, it can be formulated as a rank 1 matrix multiplication:

xWr=x(Wq)s+x1Tu.  (Formula 1)

In both the 1-bit and 2-bit situations, matrix multiplication with quantized weights can be implemented as addition operations, and multiplication itself is not required:

  1. In the case of binary weights, Wq consists of zeros and ones; only addition operations are required.

  2. In the case of two-bit quantization, you can rewrite Wq as the sum of a binary and ternary matrix, and in both cases you can fully take advantage of matrix multiplication without multiplication, which is implemented in a mixed kernel. The only change you need to make is to use a range [−1,0,1,2] instead of the original [0,1,2,3]:

[201−1]2bit=[1010]binary+[100−1]ternary

Fine tuning using low dimensional adapters

When using methods such as BitNet, the entire network is trained from scratch. Instead, it's better to try training low-dimensional adapters (LoRA/QLoRA), this is currently the most popular method for fine-tuning large models.

As the rightmost term in Formula 1 suggests, the zero point acts as a rank 1 matrix correction factor between Wqs and the original weights, and the low-dimensional adapter essentially increases the rank of this correction factor, which improves the quantization results.

Let LA and LB be low-dimensional adapter parameters of rank r. Then the operation of matrix multiplication during direct propagation takes the form:

x(Wq)s+x1Tu+xLTALB

As detailed in our earlier work on cutting off low ranks in Llama, the rank of the sum of two matrices is lower than or equal to the sum of their ranks. Therefore, x1Tu+xLTALB can be combined as a term of rank r+1 to obtain:

x(Wq)s+xLA¯TLB¯,

where LA¯ and LB¯ are obtained as a result of the low-rank decomposition of the 1Tu+LTALB matrix.

Datasets

Low-dimensional adapters were trained using the Supervised Fine-Tuning method on various publicly available datasets. Different datasets were used for the base model and the chat model. Details below:

Basic model

wikitext-2-raw-v1 (~2.8K): This entire dataset was used to further train the base model. It lays the foundation for a general understanding of the language.

Chat models

1. timdettmers/openassistant-guanaco: This dataset was completely used to fine-tune the chat model.

2. microsoft/orca-math-word-problems-200k: A subset of this dataset was used to improve the model's ability to solve mathematical word problems.

3. meta-math/MetaMathQA: Another subset of this dataset was used to further improve the model's mathematical inference capabilities.

4. HuggingFaceH4/ultrafeedback_binarized (chosen answers only): A subset of responses selected from this dataset was used to fine-tune the model's ability to generate consistent and relevant responses.

Regarding subset sizes, we used random sampling
of 10K tokens for the 2-bit model and 25K tokens for the 1-bit model.

Checkpoints

We compared the performance of the Llama2-7B in three configurations: FP16 (full precision), HQQ (no fine tuning) and HQQ+ (with adapter layers) using groups of size 8. For these experiments, we chose the Llama2-7B model because it It is relatively small, its architecture is well understood, and it is easy to experiment with. We evaluated the performance of the pretrained base model and the chat model.

Basic models

For basic models, we took into account the results Quip# (2-bit) as well as the state-of-the-art quantization method proposed by Tseng et al. To our knowledge, there is no functioning 1-bit model other than ours for Llama-7b. But for reference, we'll also include the 2-bit Quip# results.

1-bit model

Models

FP16

HQQ (1-bit)

HQQ+ (1-bit)

Quip# (2-bit)

Wiki Perplexity

5.18

9866

8.53

8.54

VRAM (GB)

13.5

1.76

1.85

2.72

Forward propagation time (s)

0.1

0.231

0.257

0.353

1-bit quantization resulted in a significant reduction in quality compared to a model running at full precision; as a result, our model was almost impossible to use. However, by introducing adapter layers, we were able to reduce the perplexity of the model to 8.53, as a result of which the model improved slightly and became comparable to the 2-bit Quip# model, which has a perplexity of 8.54, despite the fact that it only has binary weights .

2-bit model

Models

FP16

HQQ (2-bit)

HQQ+ (2-bit)

Quip# (2-bit)

Wiki Perplexity

5.18

6.06

5.14

8.54

VRAM (GB)

13.5

2.6

2.69

2.72

Forward propagation time (s)

0.1

0.221

0.27

0.353

The 2-bit HQQ model beats Quip# even without any calibration. After additional training of the adapter layers, it is remarkable that this model achieves a reduction in perplexity compared to the full accuracy model. This is a significant finding because it suggests that quantization using HQQ+ not only reduces the model's memory footprint, but also potentially helps improve the quality of language production within that model.

Chat models

For basic models, we took into account the results Quip# (2-bit) as well as the state-of-the-art quantization method proposed by Tseng et al. To our knowledge, there is no functioning 1-bit model other than ours for Llama-7b. But for reference, we'll also include the 2-bit Quip# results.

1-bit model

Models

FP16

HQQ (1-bit)

HQQ+ (1-bit)

Quip# (2-bit)

Wiki Perplexity

5.18

9866

8.53

8.54

VRAM (GB)

13.5

1.76

1.85

2.72

Forward propagation time (s)

0.1

0.231

0.257

0.353

1-bit quantization resulted in a significant reduction in quality compared to a model running at full precision; as a result, our model was almost impossible to use. However, by introducing adapter layers, we were able to reduce the perplexity of the model to 8.53, as a result of which the model improved slightly and became comparable to the 2-bit Quip# model, which has a perplexity of 8.54, despite the fact that it only has binary weights .

2-bit model

Models

FP16

HQQ (2-bit)

HQQ+ (2-bit)

Quip# (2-bit)

Wiki Perplexity

5.18

6.06

5.14

8.54

VRAM (GB)

13.5

2.6

2.69

2.72

Forward propagation time (s)

0.1

0.221

0.27

0.353

The 2-bit HQQ model beats Quip# even without any calibration. After additional training of the adapter layers, it is remarkable that this model achieves a reduction in perplexity compared to the full accuracy model. This is a significant finding because it suggests that quantization using HQQ+ not only reduces the model's memory footprint, but also potentially helps improve the quality of language production within that model.

Which models are better: quantized or small language?

On the one hand, if you train relatively small models from scratch, the computing power requirements are not so high, and the model itself trains faster. Models such as Qwen1.5 show promising results and may be attractive for application in some areas. However, according to our findings, highly quantized large models that use methods such as HQQ+ can provide even better performance, but at the same time take up less memory space.

Let us emphasize once again that these results were obtained on a relatively small Llama2-7B model. With sufficiently small quantization without using an adapter layer, as is the case with the basic version of HQQ, we observe that the larger the model in question, the higher the performance. For example, earlier quantized Mixtral model using the basic version of HQQ shows how much you can reduce the area of ​​a model in memory while maintaining high performance and seriously outperforming relatively small models.

Conclusion

Our experimental 2- and 1-bit quantized versions of the Llama2-7B model with a low-dimensional adapter, quantized using the proposed HQQ+ approach, clearly demonstrate the potential of extreme low-dimensional quantization when applied to machine learning models. Despite the problems encountered with such minimalistic settings, the output quality can be significantly improved. We show that finely tuned models can benefit from the potential of optimized low-rank matrix multiplication, thereby significantly reducing the need for computational power and memory. In this case, large language models become more accessible. Despite the fact that binary and ternary kernels for matrix multiplication do not yet exist, we hope that our work will spur interest in both software and hardware developments, and in the near future we will see their results.

Citation

@misc{badri2023hqq,
	title = {Towards 1-bit Machine Learning Models},
	url = {https://mobiusml.github.io/1bit_blog/},
	author = {Hicham Badri and Appu Shaji},
	month = {March},
	year = {2024}

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *