Artificial Intelligence in the Data Center Network: Huawei Experience

Following in the footsteps of my talk at the AI ​​Journey conference on December 4, I want to tell you how the correct application of AI systems in network management allows you to build modern data centers based on Huawei solutions without bottlenecks and without packet loss. The benefits of such solutions are especially evident when All-Flash storages are used in the data center, neural networks are trained, or high-performance GPU computing is performed.


Data center transformation

Data centers are changing conceptually, and changing dramatically. The trend became relatively widespread about ten years ago, but, say, in the banking sector, it began much earlier. Regardless of the chosen path, the goals of the transformations are more or less similar – the unification and consolidation of resources.

This is the first step, followed by further improvement of data center efficiency through automation, orchestration and transition to hybrid cloud mode. And the farthest limit of transformation attainable today is the introduction of artificial intelligence systems.

Huawei solutions for every stage of transformation

At each stage, depending on the “IT maturity” of the customer, Huawei offers its own solutions designed to provide the best modernization result without unnecessary expenses. Today I would like to talk in more detail about the “icing on the cake” – AI systems in modern data centers.

To draw an analogy with the human body, data center network switches act as a circulatory system, providing connectivity between various components: computing nodes, data storage systems, etc.

Just a few years ago, SSD storage technology became widely available, and CPU performance continues to grow. With this, storage and compute nodes are no longer the main causes of latency. But the data center network has long remained in the structure of data centers as a kind of “little brother”.

Manufacturers have tried to solve the problem in different ways. Someone chose licensed technologies to build a network InfiniBand (IB). The network turned out to be specialized and capable of solving only narrow-profile tasks. Someone preferred to build network factories based on protocols Fiber Channel (FC). Both approaches had their limitations: either the bandwidth of the network turned out to be relatively modest, or the total cost of the solution bite, which was further aggravated by the dependence on one vendor.

Our company went through the use of open technologies. Huawei solutions are based on work with the second version RoCE, the capabilities of which have been expanded through the use of additional licensed algorithms in our switches. This allowed us to seriously optimize the capabilities of the networks.

Why don’t we see the future behind classic FC solutions? The point is that they work on the principle of static credit allocation, which requires configuring the network fabric according to the needs of your applications for a limited slice of time.

Recently FC has taken a step forward towards standalone storage networks, but continues to carry performance limitations. Now the mainstream – the sixth generation of the technology, allowing to achieve 32 Gb / s throughput, 64 Gb / s solutions are beginning to be implemented. At the same time, with the help of Ethernet, today, using priority tables, we can get 100, 200 and even 400 Gbit / s to the server.

The added value of the data center network is of particular importance in a world where solid-state drives with high-speed interfaces are gaining market share, displacing classic spindle drives. Huawei is committed to enabling SSD storage to reach its full potential.

Next Generation Data Center Network

A small example of how we do it. The diagram shows one of our storage systems, recognized the fastest in the world. Shown here are our x86 or ARM-based servers, delivering performance that meets the expectations of extremely demanding clients. In data centers, based on these solutions, we manage to achieve end-to-end delay no more than 0.1 ms. The use of new application technologies helps us to get such a result.

The classic technologies used in storage were limited, in particular, by the rather high latencies that were caused by the SAS bus. Moving to new protocols such as NVMe significantly improved this parameter, and at the same time the network itself became a limiting factor in performance.

Consider, within the same example, the use of networks with additional licensed algorithms. They optimize end-to-end latency, dramatically increase network throughput, and increase I / O operations per unit time. This approach helps to avoid the “double purchase”, sometimes necessary to achieve the required performance parameters, and the total savings (in terms of TCO) when introducing a new network reaches 18-40%, depending on the equipment used.

What are these wow algorithms?

Conventional technologies brought with them the usual problems, since they worked with static thresholds of the queue. This threshold meant that there was some basic relationship between speed and latency for all applications. The manual control mode did not allow for dynamic adjustment of the network parameters.

Using additional machine learning chipsets in the switches, we taught the network to operate in a mode that allows building intelligent data center networks without packet loss (we called it iDCN).

How is smart optimization achieved? Those who are engaged in neural networks will easily find familiar elements and training / inference mechanisms on the diagram. Our solutions combine embedded models with the ability to learn on a specific network.

The AI ​​system accumulates a certain amount of knowledge about the network, which is then approximated and used in the dynamic configuration of the network. Devices based on our own hardware solutions use a special AI chip. Models built on licensed chipsets from American manufacturers use an add-on module and a software bus.

About the models used. We use an approach that relies on a reinforcement learning model. The system analyzes 100% of the data passing through the network device and selects the baseline. If, for example, you know the bandwidth and the delays that are critical for a particular application, it is not difficult to determine the baseline. With a large number of applications, it is possible to perform “median” calculations and make adjustments in automatic mode, significantly increasing performance.

The diagram shows the process in more detail. At the start of network optimization, we calculate the threshold values ​​- both minimum and maximum. Next comes into play convolutional neural network (CNN). Thus, it is possible to equalize the bandwidth and latency rates for each application, as well as determine its total “weight” within the network services. Using this stratified approach, we get some really interesting insights.

When the application is unknown, the heuristic search algorithm is applied in conjunction with “state machine“. With its help, we begin to move counterclockwise along the block diagram shown above, identifying threshold values ​​and building a model. It is an automatic process that can be manipulated as needed. If this is not necessary, it is easier to rely on the switch and its services.

From theory to practice

By applying such algorithms and working at the level of the entire network, and not its individual slices, we solve all the main performance problems. There are already interesting cases of implementation and use of such technologies in the banking sector. These mechanisms are also in demand in other industries, for example, among telecom operators.

Let’s look at the results of open tests. Independent laboratory The tolly group tested our solution and compared it with Ethernet and IB solutions from other manufacturers. Tests have shown that Huawei’s product performance is equivalent to that of IB and 27% better than other major Ethernet products.

The lossless data center network demonstrates maximum efficiency in several scenarios, such as:

  • AI training;
  • centralized storage;
  • distributed storage;
  • high performance GPU computing.

In conclusion, we will consider one of the scenarios for using an intelligent data center network. Many customers use distributed storage systems (SDS). By integrating software storage systems from different manufacturers with the help of our solution, you can achieve 40% higher performance than without it. This means that when you know the required performance level of your SDS, it can be achieved using 40% fewer servers.

***

By the way, do not forget about our numerous webinars held not only in the Russian-speaking segment, but also at the global level. List of webinars for December is available at link

Similar Posts

Leave a Reply