Intel Gaudi — AI Accelerator Race

ServerFlow and we want to talk about the pressing issue – AI with neural networks, or more precisely, about the hardware on which neural networks are trained and on which they subsequently work. In recent years, this industry has resembled a fight club arena, where tech giants with fierce competition strive to offer the most productive and effective solutions for machine learning. And although it does not seem that anyone in this arena will be able to displace the market leader in the person of NVIDIA, however, attempts continue to be made.
Intel continues to do so, introducing its series of AI accelerators under the Gaudi brand to the world, and not long ago, the updated Gaudi 3 model. Intel had previously attempted to develop its own AI accelerators, but this time the work was taken up by Habana Labs, acquired by Intel in 2019 for an impressive $2 billion.

The Path to Gaudi

The roots of Gaudi architecture go deep into the developments of the Israeli startup Habana Labs, which was founded in 2016 by a group of experienced engineers and entrepreneurs.

Habana Labs' first major achievement was the release of Goya, a processor optimized for neural network inference. Goya demonstrated impressive results in machine learning tasks. High performance coupled with competitive energy efficiency attracted the attention of tech giants, including Intel.

Seeing the potential of Habana Labs' future developments and their possible impact on the AI ​​accelerator market, Intel made a strategic decision to fully acquire the company for a hefty $2 billion. This decision was driven not only by the success of Goya, but also by the prospects of the Gaudi processor being developed for training neural networks.

Intel Gaudi HL-205 accelerators installed in a dedicated Habana Labs Gaudi HLS 1 OAM server

Intel Gaudi HL-205 accelerators installed in a dedicated Habana Labs Gaudi HLS 1 OAM server

It is worth noting that Intel has previously attempted to develop its own solutions for working with AI, such as Intel Loihi, Nervana or the consumer Neural Compute Stick. However, these projects failed to achieve mass success due to a lack of competitiveness. This is what prompted Intel to acquire Habana Labs, whose developments have already proven their effectiveness, to quickly strengthen its position in the AI ​​accelerator market.

This decision, although costly, turned out to be strategically justified for Intel. The acquisition of Habana Labs not only gave the company access to cutting-edge AI technologies, but also allowed it to quickly strengthen its position in this promising market, compensating for the lag behind competitors in the field of AI accelerators.

Intel Loihi is a neuromorphic chip designed to mimic the behavior of biological neural networks, enabling efficient execution of AI and machine learning tasks using spiking neural networks.

Intel Loihi is a neuromorphic chip designed to mimic the behavior of biological neural networks, enabling efficient execution of AI and machine learning tasks using spiking neural networks.

Breakthrough Gaudi 3

Today, the Gaudi line is actively developing and already has three generations, demonstrating the continuous improvement of Intel technologies in the field of AI accelerators. The latest achievement in this evolution is Gaudi 3, a device for hardware acceleration of tasks in the field of machine learning.

Gaudi 3 belongs to the NPU (Neural Processing Unit) class and is a specialized processor optimized for working with neural networks. Unlike general-purpose GPUs, NPUs are designed to efficiently process tensors – multidimensional data arrays that are the basis of deep learning calculations.

To better understand the difference between NPU and GPU, we can make the following comparison: if we imagine that one GPU block can process one data vector at a time, then a similar NPU block can simultaneously operate on an entire tensor, which significantly speeds up calculations in AI tasks. It is the abundance of tensor cores that makes NPUs unprecedentedly effective in AI training tasks.

This architectural feature enables Gaudi 3 to achieve impressive performance in machine learning and artificial intelligence tasks, providing a significant advantage over traditional computing architectures in specific AI-centric use cases.

Key benefits and possible implementation options of Intel Gaudi 3

Key benefits and possible implementation options of Intel Gaudi 3

Gaudi 3 embodies this concept, offering a chip based on a 5 nm process technology with 64 tensor cores and 128 GB of high-speed HBM2e memory. Its architecture is optimized for working with large language models and includes specialized engines for matrix calculations. Separately, it is worth noting that the integration of the network adapter directly into the NPU (Neural Processing Unit) crystal is a key feature of the Gaudi 3 architecture. This allows for significantly increased scalability of the system, especially when working with LLM – large language models and other tasks related to machine learning. The presence of 24 links at 200 Gbps provides high throughput for data transfer, which is critical for distributed computing and processing large amounts of data.

Comparison with competitors

In tests from Intel, Gaudi 3 demonstrates impressive results compared to its direct competitors from NVIDIA. Tests were conducted on LLM (large language model) training tasks LLAMA2 and GPT3, on 7, 13 and 175 billion parameters, where Gaudi demonstrates up to 1.7x* higher performance compared to NVIDIA H100. This significant performance gain is especially important in the context of training large-scale language models and other complex neural networks, where training time is a critical factor.

Intel slide showing the superiority of Gaudi's NPU over its Tesla H100 counterpart

An equally important aspect is the energy efficiency of Gaudi 3, especially in inference tasks, where it demonstrates up to 40% better efficiency compared to competitors. This advantage is of great importance for large data centers and cloud providers, where energy optimization directly affects operating costs and the environmental friendliness of the infrastructure.

Such impressive results are achieved thanks to the synergy of several key factors:

  1. High computing power provided by an increased number of tensor processors and specialized matrix engines.

  2. Improved memory architecture with higher capacity and bandwidth compared to Tesla H100, which is critical for working with large models and datasets.

  3. An efficient network infrastructure that enables the creation of scalable systems with high throughput between nodes due to the already integrated network adapter.

The combination of these factors makes Gaudi 3 a powerful and versatile tool for solving a wide range of complex AI problems, from training large-scale language models to high-performance real-time inference.

NVIDIA Showdown: Is There a Chance?

“Universal baseboard” based on the latest Gaudi HLB-325 designed to compete with DGX systems from Nvidia by effectively combining accelerator resources

Intel Gaudi 3 demonstrates impressive results against the competition, challenging even the most powerful solutions on the market. The configuration with eight Gaudi 3 accelerators achieves a phenomenal performance of 14.68 petaflops in FP16 (BFloat16) calculations. This significantly exceeds the 8 petaflops of a similar configuration on NVIDIA H100, which indicates a significant technological breakthrough for Intel.

Moreover, the cost-effectiveness of the Gaudi 3 takes it to a new level of competitiveness: the cost per petaflop of performance is about $18.7, while the H100 costs $46.8. This is an almost 2.5-fold advantage in price/performance ratio, which makes the Gaudi 3 not just a serious competitor, but a potential market leader. But there is a nuance here.

The nuance is that in the case of the H100, all these teraflops of power will definitely be maximally compatible with a wide range of different libraries, frameworks and ready-made AI models, since NVIDIA surpasses its competitors not so much in hardware developments as in the advanced ecosystem of its software built around the four cherished letters – CUDA.

Slide with detailed comparison of Gaudi 3 with H100. Intel's accelerator is on average 50% faster in similar scenarios

Slide with detailed comparison of Gaudi 3 with H100. Intel's accelerator is on average 50% faster in similar scenarios

But will the same situation happen with the product from Intel, how truthful are their tests and will any other pitfalls be revealed in the process of debugging accelerators – is a mystery.

However, it is important to note that Gaudi 3’s advantages are most pronounced in specific use cases. In particular, its superiority is especially noticeable in tasks that require processing large amounts of data in memory. Gaudi 3 comes with a massive 128 GB of HBM2e memory, which is significantly larger than the 80 GB of HBM3 in the H100. This gives Gaudi 3 a significant advantage when working with large-scale machine learning models and natural language processing tasks, where the amount of data being processed is critical to achieving high accuracy results.

Why Gaudi 3 when there is Intel GPU Max?

Intel GPU MAX Product Line

Intel GPU MAX Product Line

While Gaudi 3 may seem redundant alongside the existing Max GPU lineup, it reflects Intel’s deep understanding of the diverse needs of the AI ​​computing market. Gaudi 3 is not a duplication of effort, but a strategic move to address different segments of the HPC market.

Based on the Xe architecture, the GPU Max line targets a wide range of workloads, including both traditional graphics computing and general parallel computing for AI. This versatility makes GPU Max an ideal choice for organizations that need flexible solutions that can adapt to different types of workloads.

In contrast, Gaudi 3 is a dedicated solution optimized exclusively for deep learning and AI inference. Its tensor computing-based architecture enables unprecedented efficiency in specific AI tasks, especially those that require processing large amounts of data and complex matrix operations.

With Gaudi 3, Intel aims to provide the optimal solution for organizations focused solely on the development and application of AI technologies. This allows the company to meet the needs of both customers who require maximum flexibility (with GPU Max) and those who are looking for unmatched performance in highly specialized AI tasks (with Gaudi 3).

Gaudi's Success on Amazon Web Services

AWS is currently one of the key and largest customers using Gaudi to work with machine learning. At the moment, it generally seems that Amazon may have initiated further work on Gaudi, possibly even using Intel as an outsourced developer, however, this is only speculation.

In practical terms, this resulted in the creation of Amazon EC2 DL1 instances, tailored for machine learning tasks. AWS conducted serious testing of these instances, building a cluster of 16 machines, each with eight Gaudi accelerators. On this hardware, they tested training large language models, in particular BERT with 1.5 billion parameters.

The results were quite good. On 128 accelerators, it was possible to achieve 82.7% scaling efficiency when training BERT – this is a very decent result. Using Gaudi's “native” BF16 support, AWS engineers managed to reduce memory appetite and speed up the training process. As a result, with the help of Habana software and the DeepSpeed ​​library, they were able to pre-train a huge BERT model in 16 hours on a network of 128 accelerators.

Conclusion

It is worth noting that Intel managed to achieve truly impressive and breakthrough results at the level of Gaudi 3 architecture design. But good hardware is only part of the success, the real test is stable, reliable and compatible with popular software frameworks. Nvidia has been working hard for years to develop CUDA, at the API level, documentation, drivers, cooperating with machine learning framework developers. Take any of the 10 most popular libraries for neural networks – and most likely the hardware acceleration there will be sharpened primarily for Nvidia technologies. But not Intel.
And in general, outside of machine learning, Intel has never been known for stable, reliable, and well-compatible software with third-party solutions. On the one hand, this vicious cycle is being broken by their discrete graphics cards for the PC sector, which from a disastrous start received regular driver updates, which eventually made them a financially attractive option for the budget segment. And on the other hand, we have AI modules in the latest generations of Intel processors, which at best do not work at all due to lack of support or bugs in the drivers, or at worst cause global problems throughout the system.

However, if Intel focuses its resources not only on the design of new Gaudi models, but also on a software ecosystem that meets the needs of its customers, then it is safe to assume that this line of AI accelerators will most likely not be forgotten like their predecessors.
What do you think about this? It will be interesting to read your opinion in the comments, and thanks for reading to the end!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *