1-bit LLMs can solve AI power consumption problem
“Inaccurate” language models are smaller, faster and almost as accurate
The big language models, the artificial intelligence systems that power chatbots like ChatGPT, are getting better and better, but they're also getting bigger and bigger, requiring more energy and processing power. For LLMs to be cheap, fast and environmentally friendly, they need to be made smaller, ideally so that they can run directly on devices like mobile phones. Researchers are finding ways to do this by radically rounding the many high-precision numbers stored in memories to the value of 1 or -1.
LLMs, like all neural networks, learn by changing the strength of connections between artificial neurons. These values are stored as mathematical parameters. Researchers have long compressed networks by reducing the precision of these parameters, a process called quantization, so that instead of 16 bits each, there are 8 or 4. Now researchers are bringing the precision down to one bit.
How to make a 1-bit LLM
There are two general approaches. One approach, called post-training quantization (PTQ), is to quantize the network parameters with full precision. Another approach, quantization-aware learning (QAT), is to train the network from scratch to obtain low-precision parameters. Until now, the PTQ has been more popular among researchers.
In February, a team including Haotong Qin from ETH Zürich, Xianglong Liu from Beihang University and Wei Huang from the University of Hong Kong introduced a PTQ method called BiLLM. It approximates most network parameters using 1 bit, but represents some important weights that have the greatest impact on performance using 2 bits. In one test, the team binarized Meta's version of LLM, LLaMa, which contains 13 billion parameters.
Single-bit LLMs open up new opportunities for the development of hardware and systems specifically optimized to work with 1-bit LLMs.
-Furu Wei, Microsoft Asia Research
To measure performance, the researchers used a metric called “perplexity,” which is essentially a measure of how surprised the trained model was by each subsequent piece of text. For one data set, the original model had a perplexity score of around 5, and the BiLLM version had a score of around 15, much better than its closest competitor with binarization, which scored around 37 (for perplexity, lower numbers are better). At the same time, the BiLLM model required approximately a tenth of the memory capacity compared to the original.
PTQ has several advantages over QAT, says Wanxiang Che, a computer scientist at Harbin Institute of Technology in China. PTQ does not require collecting training data, does not need to train the model from scratch, and the training process itself is more stable. On the other hand, QAT can make models more accurate because quantization is built into the model from the very beginning.
1-bit LLMs hold their own against their larger cousins
Last year, a team led by Furu Wei and Shuming Ma from Microsoft Research Asia in Beijing created BitNet, the first 1-bit QAT method for LLM. After changing the rate at which the network adjusts its parameters to stabilize learning, they created LLMs that performed better than those created using PTQ methods. They were still not as good as full precision networks, but they were about 10 times more energy efficient.
In February, Wei's team announced BitNet 1.58b, in which parameters can be -1, 0, or 1, meaning they take up approximately 1.58 bits of memory per parameter. The 3 billion parameter BitNet model performed as well on a variety of language tasks as the full precision LLaMA model with the same number of parameters and training amount, but was 2.71 times faster and used 72% less memory GPU and consumed 94% less GPU power. Wei called it “the moment of truth.” Additionally, the researchers found that as larger models were trained, performance improved.
The BitNet model with 3 billion parameters performed as well as the LLaMA model with full accuracy on various language tasks.
This year, a team led by Che at Harbin Institute of Technology published a preliminary report on another LLM binarization method called OneBit. OneBit combines elements of PTQ and QAT. It uses a pre-trained full-fidelity LLM to generate training data for the quantized version. On one data set, a model with 13 billion parameters achieved a perplexity score of about 9, while the LLaMA model with 13 billion parameters achieved a perplexity score of 5. However, OneBit occupied only 10 percent more memory. Presumably, it could work much faster on specialized chips.
According to Microsoft's Wei, quantized models have many advantages. They can fit on smaller chips, require less data transfer between memory and processor, and allow data to be processed faster. However, current equipment cannot take full advantage of these models. LLMs often run on GPUs such as Nvidia, which represent weights with high precision and spend most of the energy multiplying them. New hardware could represent each parameter as -1 or 1 (or 0), and then simply add and subtract the values, avoiding multiplication. “Single-bit LLMs open up new possibilities for developing custom hardware and systems that are optimized to work with single-bit LLMs,” says Wei.
“They have to grow together,” says Huang of the University of Hong Kong about 1-bit models and processors. “But it's a long road to developing new equipment.”