How researchers are breaking conventional AI approaches by eliminating matrix multiplication

This chart, taken from the article, shows the relative performance of LLM without MatMul compared to regular (Transformer++) LLM on test cases

Hello, this is Elena Kuznetsova, automation specialist at Sherpa Robotics. Today I translated for you an article on using AI models without unnecessary mathematics. We all know that neural networks are an energy-consuming business. And the research described in the article can help reduce energy consumption in the operation of neural networks.

Researchers from the University of California at Santa Cruz, UC Davis, LuxiTech and Soochow University have announced the development of a new approach to optimizing the performance of AI language models that eliminates matrix multiplication from computational processes. This fundamentally changes the operations of neural networks, which are currently accelerated by graphics processing units (GPUs). The findings, outlined in a recent preprint, could have a significant impact on the environmental sustainability and operating costs of AI systems.

Matrix multiplication, or MatMul, is a key element of most neural network computational problems. GPUs are particularly efficient at performing these operations due to their ability to perform many multiplications in parallel. That ability temporarily made Nvidia the world's most valuable company last week: It now has about 98% of the market for data center GPUs, which are widely used to run AI systems like ChatGPT and Google Gemini.

In the new paper, entitled “Scalable Language Modeling Without MatMul,” the researchers describe creating a custom model with 2.7 billion parameters that shows results comparable to traditional large language models, but without using MatMul. They also show the ability to process 1.3 billion parameters at 23.8 tokens per second on a GPU accelerated with a programmable FPGA chip that consumes only 13 watts (not including the power consumption of the GPU itself). This paves the way for the development of more efficient and adaptive architectures, the study authors say.

Although the method has not yet been peer-reviewed, the researchers – Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Shives, Yijiao Wang, Dustin Richmond, Peng Zhou and Jason Eshraghian – say their work challenges conventional wisdom that matrix operations multiplications are essential for creating highly efficient language models. They insist that their approach can make large language models more accessible, efficient and robust, especially for use on resource-constrained hardware such as smartphones.

Getting rid of matrix math

In a recent paper, the researchers cite BitNet, a “1-bit” transformer technique that went viral as a preprint in October, as an important precursor to their work. According to the authors, BitNet demonstrated the possibility of using binary and ternary

weights in language models, successfully scaling to 3 billion parameters while maintaining competitive performance.

However, the authors note that BitNet still relied on matrix multiplications in its self-attention mechanism. The limitations of BitNet motivated the present study, prompting the authors to develop a completely “MatMul-free” architecture that could maintain performance by eliminating matrix operations even in the attention engine.

The researchers proposed two major innovations: First, they created a custom language model, limiting its use to only ternary values (-1, 0, 1) instead of traditional floating point numbers, allowing for easier calculations. Second, they reverse-engineered the resource-intensive self-attention mechanism of traditional language models, replacing it with a simpler, more efficient element they called the MatMul-free Linear Gated Recurrent Unit (MLGRU), which processes words sequentially using basic arithmetic operations instead of matrix multiplications.

The third innovation is the adaptation of a controlled linear unit (GLU), a mechanism for controlling the flow of information in neural networks, to use ternary weights when mixing channels. Channel mixing refers to the process of combining and transforming different aspects or characteristics of the data that the AI is working with, similar to how a DJ mixes different audio channels to create a cohesive composition.

These changes, combined with a custom hardware implementation for accelerating ternary operations using an FPGA chip, allowed the researchers to achieve what they claim is performance comparable to state-of-the-art models while significantly reducing power consumption. Although they performed comparisons on GPUs to evaluate their models against traditional ones, the MatMul-free models are optimized to run on hardware suitable for simpler arithmetic operations, such as FPGAs. This suggests that these models can be run efficiently on a variety of hardware types, including those that have more limited computational resources compared to GPUs.

To evaluate their approach, the researchers compared their MatMul-free model with a replicated Llama-2 model (which was called “Transformer++”) across three model sizes: 370M, 1.3B, and 2.7B parameters. All models were pre-trained on the SlimPajama dataset, with larger models trained on 100 billion tokens each. The researchers claim that the MatMul-free model performed competitively against the base Llama 2 model on several benchmarks, including question answering, general reasoning, and physical understanding.

In addition to reducing power consumption, the MatMul-free model significantly reduced memory usage. Their optimized GPU implementation reduced memory consumption by up to 61% during training compared to the unoptimized baseline model.

It is worth noting that the Llama-2 model with 2.7 billion parameters is a far cry from the current best LLMs on the market such as GPT-4, which is estimated to contain over 1 trillion parameters. GPT-3 came out in 2020 with 175 billion parameters. The number of parameters generally means greater complexity (and, roughly speaking, ability) of the model, but at the same time, researchers are finding ways to achieve higher levels of LLM performance with fewer parameters.

Thus, we have not yet achieved a level of processing comparable to ChatGPT, but the UC Santa Cruz technicians do not rule out the possibility of achieving this level of performance with additional resources.

Extrapolation into the future

The researchers claim that the scaling laws observed in their experiments suggest that a MatMul-free language model can outperform traditional LLMs at very large scales. The authors predict that their approach could theoretically match and even surpass the performance of standard LLMs at scales of around 1023 FLOPS, which is roughly the same computational resources required to train models such as Meta's Llama-3 with 8 billion parameters or Llama-2 with 70 billion.

However, the authors note that their work has limitations. The MatMul-free model has not yet been tested on extra-large models (e.g., with more than 100 billion parameters) due to computational limitations. They encourage organizations with large resources to invest in scaling and further developing this lightweight approach to language modeling.