# Comparison of different quantization schemes for LLM

## What is quantization?

Quantization is a model compression technique that transforms the weights and activations in the LLM by reducing the computational bitness, i.e., from a data type that can contain more information to a type that contains less. A common example of this is converting data from a 16-bit real number (F16) to an 8-bit or 4-bit integer (Q8 or Q4).

A great analogy for understanding quantization is image compression. Compressing an image involves reducing its size by removing some of the information, i.e., data bits, from it. Now, while reducing the size of an image usually reduces its quality (to an acceptable level), it also means that more images can be stored on a given device while requiring less time and bandwidth to transfer or display to the user. Likewise, quantizing an LLM increases its portability and the number of ways it can be deployed—albeit at an acceptable sacrifice of detail or precision.

## Why is quantization needed?

Quantization is an important process in machine learning because reducing the number of bits required for each weight of a model results in a significant reduction in its overall size. Therefore, quantization creates LLMs that consume less memory, require less storage space, are more energy efficient, and are capable of faster inference. All this provides a critical advantage that allows LLM to work on a wider range of devices.

To run Llama 70B without quantization, a 130GB GPU is required. If we apply 4-bit quantization, a 40 GB GPU will be required and the accuracy loss will be 4 percent.

## What are the advantages and disadvantages of quantized LLMs?

Let's look at the pros and cons of quantization.

**Advantages**:

Smaller model size: By reducing the size of the weights, quantization results in more compact models. This allows them to be used in a wider range of situations, such as with less powerful equipment, and reduces storage costs. This makes it possible to increase scalability.

Faster execution: Using lower bit operations for weights and correspondingly reducing memory requirements results in more efficient calculations.

**Flaws**:

Loss of Precision: By far the most significant disadvantage of quantization is the potential loss of precision in the output data. Converting a model's weights to a lower precision will likely degrade its performance, and the more “aggressive” the quantization technique is, i.e. the lower the bit conversion, e.g. 3 bits, 2 bits, etc., the higher the risk of losing a lot of precision.

## Quantization schemes

The existing ggml quantization types have “type-0” (Q4_0, Q5_0) and “type-1” (Q4_1, Q5_1). In “type-0”, the weights w are obtained from the quanta q using w = d * q, where d is the block scale. In “type-1” the weights are given by w = d * q + m, where m is the minimum of the block.

**Q2_K** – “type-1” 2-bit quantization in superblocks containing 16 blocks, each block has 16 weights. Block scales and minima are quantized into 4 bits. This results in an effective use of 2.5625 bits per weight (bpw).

**Q3_K** – “type-0” 3-bit quantization in superblocks containing 16 blocks, each of which has 16 weights. The scales are quantized by 6 bits. As a result, 3.4375 bpw is used.

**Q3_K_S** = Uses Q3_K for all tensers.

**Q3_K_M** = Uses Q4_K for tensers `attention.wv(`

weights used to compute the query vector in the attention layer), tensor `attention.wo`

*(*represents the weights used to compute the output vector in the attention layer), tensor `feed_forward.w2`

(weights used in the feedforward layer after the attention layer). Otherwise Q3_K.

**Q3_K_L** = same as Q3_K_M, but uses Q5_K for selected tensers.

**Q4_0** = 32 numbers per block, 4 bits per weight, average 5 bits per value, each weight given by overall scale * quantized value.

**Q4_1** = 32 numbers per block, 4 bits per weight, average 6 bits per value, each weight given by total scale * quantized value + total offset.

**Q5_0** = 32 numbers per block, 5 bits per weight, 1 scale value in a 16-bit float, size is 5.5 bpw.

**Q5_1** = 32 numbers in a chunk, 5 bits per weight, 1 scale value in a 16-bit float and 1 offset value in 16 bits, size – 6 bpw.

**Q6_K** – 6-bit “type-0” quantization. Superblocks with 16 blocks, each block has 16 weights. The scales are quantized by 8 bits. As a result, 6.5625 bpw is used

**Q8_0** = same as q4_0, only 8 bits per weight, 1 scale value of 32 bits, total 9 bpw

For the rest, it’s the same, only the number of bits per weight changes.

## Bits per symbol(bpw)

Bits-per-character (bpw) is a metric reflected by language models. It measures exactly what it says in its name: the average number of bits required to encode one character. This leads to a revision of Shannon's explanation of the entropy of language:

“if a language is translated into binary digits (0 or 1) in the most efficient manner, the entropy is the average number of binary digits required for each letter of the source language.”

## Perplexity

Perplexity (PPL) is one of the most common metrics for evaluating language models. This metric is applicable specifically to classical language models.

PPL is defined as the exponential average of the negative log-likelihood of all words in the input sequence.

## Determining the best quantization method for the 70B

Model name | Quantization | Model size | RAM | Perplexity | Delta to fp16% |

LLama 70B | Q4_0 | 36.20 GB | 41.37 GB | 3.5550 | 3.61% |

Q4_1 | 40.20 GB | 45.67 GB | 3.5125 | 2.37% | |

Q5_0 | 44.20 GB | 49.96 GB | 3.4744 | 1.26% | |

IQ2XS | 19.4 GB | 21.1 GB | 4,090 | 19.21% | |

Q2_K | 27.27 GB | 31.09 GB | 3.7739 | 8.82% | |

Q3_K_S | 28.6 GB | 31.4 GB | 3.7019 | 7.89% | |

Q3_K_M | 30.83 GB | 35.54 GB | 3.5932 | 4.72% | |

Q3_K_L | 33.67 GB | 38.65 GB | 3.5617 | 3.80% | |

Q4_K_S | 36.39 GB | 41.37 GB | 3.4852 | 1.57% | |

Q4_K_M | 39.5 GB | 42.1 Gb | 3.4725 | 1.20% | |

Q5_K_S | 45.3 GB | 47.7 Gb | 3.4483 | 0.50% | |

Q5_K_M | 46.5 GB | 48.9 Gb | 3.4451 | 0.40% | |

Q6_K | 54.0 GB | 56.1 GB | 3.4367 | 0.16% | |

F16 | 128.5 GB | 3.4313 | 0% |

## Conclusions:

The most efficient quantization method for a 48GB video card is Q5_K_S and Q5_K_M. I myself use Q5_K_S in production.

The 70B runs at 28GB and does quite well; if we compare it with the 33B models in normal quantization, the winner will most likely go to its older brother.

A larger quantization dimension does not mean better inference. More stable and predictable – yes. There is no better quality. Apparently, straightforward schemes take into account “outliers” in the weights less well (not to mention the importance matrix) and do not fully reveal their capabilities.

A quantization benchmark for LLM sizes 56B, 33B, 13B and 7B is under development. If you don’t want to miss it, we invite you to subscribe to the author’s Telegram channel: it_garden. A final table with the benchmark results will be published there.