Notes from the Computing, Memory, and Storage Summit

Hello, Sergey Bashirov is back in touch, a leading developer from the R&D team at Cloud.ru. I recently visited another Compute, Memory, and Storage Summitwhich featured quite a few reports on the topic of Compute Express Link (CXL). In the article, I made a brief summary of the presentations, and also shared my observations and conclusions. I explained how CXL is useful and how this technology works, analyzed application scenarios in cloud infrastructure, and also shared links to interesting reports on the topic.

Why CXL if there is PCIe?

Many who hear about CXL for the first time or have not delved into the technical details may ask a very fair question: why another bus, when there is PCI Express (PCIe)? Two key disadvantages of PCIe immediately come to mind:

  • low productivity,

  • incoherent access to device resources.

What is coherence

Cache coherence in a shared-memory multiprocessor system is the ability of all processors to see the same values ​​in the same memory location, even when it is modified.

One of the options for implementing such a system is monitoring transactions on the bus — snooping. This method allows system participants (agents) to learn about changes in their cached data in a timely manner. Such buses are called coherent buses.

As an example, consider the diagram below: the MMIO and PCIe physical memory ranges are configured by the BIOS and operating system as non-cached memory regions, which is why any processor access to these addresses turns into an IO-bound task.

Source: documentation "Intel® Xeon® Processor 7500 Series Datasheet, Volume 2"

Source: Intel® Xeon® Processor 7500 Series Datasheet, Volume 2

You can read more about how cache coherence is implemented in a processor here on the wiki page or on the Real World Tech website.

Supercomputers lack the bandwidth of the PCIe bus, especially when it comes to complex scientific simulations and training large AI models. Why? Typically, such tasks involve computing simultaneously on multiple GPUs distributed in a cluster of several per host. Intermediate data must be exchanged between video cards both on the local host and over the network. But modern video cards are so fast that the PCIe bus bandwidth becomes a bottleneck.

This limitation has prompted vendors to develop proprietary interconnects to directly link video cards together. A single NVIDIA H100 Tensor Core GPU supports up to 18 NVLink connections with a total throughput of 900 GB/s. This is more than seven times greater than the capabilities of the fifth-generation PCIe interface, which has a throughput of up to 128 GB/s.

Coherence is a bit more interesting. Since PCIe is a non-coherent bus, computing devices cannot effectively share resources between themselves and the host. For example, devices cannot cache data on their side unless they own it exclusively. As a result, parallel processing of the same data by multiple accelerators or an accelerator and a central processor becomes difficult and slow.

Another problem concerns host memory expanders – such devices require integration with the processor cache at the hardware level. Otherwise, the performance of the solution will be low, and programmers will have extra work to synchronize data in the code and a good chance of making mistakes.

So the interest in developing the CXL bus is justified – the industry is looking to support new types of devices and find new application scenarios.

CXL Bus Organization

Another important thing to say about PCIe is that it is organized as a network. Unlike legacy PCI, where, for example, there is a separate interrupt pin, PCIe communicates only using messages. And it only imitates these same outdated interrupt pins.

CXL uses physical and electrical PCIe interfaces — the same wires, the same network. But its logical messages are different. That is, to support the protocol itself, you only need to make changes to state machines, firmware of devices, switches and root complexes, as well as drivers.

For example, when a CXL device is inserted into a universal slot on a host, the interaction first occurs via the PCIe 5.0 protocol – to agree on the parameters of the data transfer channel and check the CXL support from the processor. If everything goes well, the CXL protocol is used next. If not, the device will “communicate” with the host via the PCIe protocol.

CXL Protocol

CXL has three communication protocols:

  • CXL.io — is responsible for the non-coherent interface of the device. In terms of functionality and meaning, it repeats the PCIe protocol. It is mainly used for device detection and power management, access to registers and error transmission;

  • CXL.cache — is responsible for accessing the host memory and caching its contents, and also accepts snoop requests from the host and returns modified data — to maintain coherence;

  • CXL.memory — provides a transactional interface for the central processor to access the device's memory.

The combination of three protocols allows you to set up shared memory between the host and the device. In this case, hardware data synchronization provides minimal delays and, since the data is cached not only by the central processor, but also by the accelerator, performance can increase.

CXL Device Types

There are three types of devices:

Type 1 Devices — use the CXL.io and CXL.cache protocols. These are accelerators without their own RAM. For example, coprocessors or SmartNIC. Such devices can have coherent access to the CPU RAM to implement a zero-copy approach to data processing.

Type 2 Devices — use the CXL.io, CXL.cache and CXL.memory protocols. This class includes accelerators with their own RAM. For example, GPU, ASIC or FPGA. They can both receive cached access to the RAM on the host and provide the central processor with cached access to its memory, forming a heterogeneous computing system.

Type 3 Devices — use the CXL.io and CXL.memory protocols. These are memory expanders. Such devices do not have their own computing resources, but can provide the central processor with cached access to additional operational or persistent memory.

CXL Device Types. Source: CXL 3.1 Specification

CXL Device Types. Source: CXL 3.1 Specification

Thus, the familiar set of PCI devices is supplemented by new types of CXL accelerators and memory expanders.

CXL Cloud Use Cases

How can CXL devices be useful in cloud infrastructure? Below are the main options.

Offloading. Specialized accelerators, including those with their own RAM or persistent memory on board. Allow you to transfer the implementation of cloud infrastructure to them and give more CPU cores to virtual machines. It is even possible to provide isolation for connecting dedicated bare metal hosts to the cloud.

It is also possible to increase the performance of storage systems and databases, and in some cases reduce the amount of data transferred over a network or bus, by performing computations on the storage device side. For example, such a device can analyze, encrypt, compress, calculate, filter, or aggregate data on the fly.

Memory Expander. Expanding the host's RAM to tens or even hundreds of terabytes. Particularly relevant for AI tasks and high-performance computing (HPC).

Memory Pooling and Sharing. Connecting multiple compute nodes to CXL switches with a large amount of memory attached to them. In this case, instead of reserving a large amount of memory on each host locally, memory can be allocated from the pool on demand for large tasks. This allows you to reduce the cost of servers.

You can also set up live migration of virtual machines by completely squeezing their RAM into a common pool and transferring calculations to another host. To do this, you need to support storing cold pages of virtual machine memory in a common CXL pool, and for migration, squeeze out hot pages as well. This will help avoid unnecessary loads on the network.

Alternative technologies

Is it possible to go the route without CXL? Generally speaking, yes.

For example, use RDMA – in this case, data copying is transferred from the central processor to the network card. Pros: due to the lack of coherence, RDMA is a widely available technology with a simplified implementation. You can organize scenarios with memory expander and memory pooling. For example, the same displacement of cold pages from local memory to a remote host. Cons: a complex software implementation will be needed, while in CXL this can be done in hardware. Also, with RDMA, it will be difficult to configure a scenario with shared memory.

You can also use PCIe NTB (non-transparent bridging) – the technology allows you to link several PCIe domains together. Devices on different hosts will interact with each other, and most importantly, you can configure access to remote memory. As in RDMA, the CPU does not copy data to the remote host, this is done by the PCIe Root Complex. Scenarios with memory expander and memory pooling can be implemented in software. But, as you remember, the PCIe bus is a non-coherent bus, so in scenarios with shared memory it will be as difficult to achieve efficiency as with RDMA. You can also configure RDMA over NTB.

Both technologies allow organizing the disaggregated memory approach. However, it is not possible to create an effective heterogeneous system – when several computers work in parallel, it is not possible to cache data as easily as when using CXL.

Conclusion

What conclusions did I draw about CXL technology following the summit?

First, the new CXL bus has not improved its throughput compared to PCIe. However, new functional capabilities have emerged that open up interesting application scenarios, including in cloud infrastructure. And compared to alternative approaches to organizing remote memory, full hardware implementation promises extremely low latencies. Servers based on AMD EPYC Gen4 processors already support RAM expansion via the CXL bus, and servers based on Intel Xeon Scalable Gen5 processors support all three types of CXL devices.

Secondly, it is curious that the CXL specification has already been released as version 3.1, but the hardware that can be purchased now only implements versions up to 2.0. Probably, the main application scenarios are quite narrowly specialized, so the adaptation of the technology is slow. Hundreds of terabytes of RAM is a very exotic setup, and hardware accelerators tailored for a specific task are an expensive and complex design that only large vendors can afford. It turns out that without the possibility of mass application, such hardware develops slowly. In any case, our team is really looking forward to the opportunity to try it out in action.

I marked these reports about CXL as the most interesting – you can read them or watch the recording:

And overall, it was an interesting summit for those who like to talk about “spaceships”. The report I remember most was Breakthrough in Cyber ​​Security Detection Using Computational Storage. The authors have learned to detect malicious activity (anomaly detection) from the disk itself without using host resources to protect their customers from encryption and extortion viruses (ransomeware detection). I recommend watching it.

And if you want to hear about our research projects (and not only), come to the GoCloud Tech conference October 24 in Moscow or join the online broadcast.


Interesting things in the blog:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *