Reinforcement Learning and Heuristic Analysis on Data Center Switches: Prerequisites and Benefits

Before the AI ​​Journey conference, which Huawei supports as a titular partner and at which several of our speakers will speak, we decided to share preliminary information about our developments, and in particular about how we use artificial intelligence in smart data center networks. And at the same time to explain why established technologies are not enough to build modern data center networks and we need “friendly help” from AI.

What is happening in the field of conditional lossless networks

Over the years, when data transmission media experienced rapid development, engineers managed to face many phenomena that hinder the successful implementation of storage networks and high-performance computing clusters on Ethernet: losses, non-guaranteed information delivery, deadlocks, microburst, and other unpleasant things.

As a result, it was considered correct to build a reference dedicated network for a specific scenario:

  • IB for clusters of high load computing;
  • FC for classic storage network;
  • Ethernet for service tasks.

Attempts to achieve versatility looked something like the illustration.

For some tasks, the vectors could coincide (similar to that of a swan and a crayfish), and situational versatility was achieved, albeit with lower efficiency than when choosing a highly specialized scenario.

Today Huawei sees the future in multitasking converged factories and offers its customers a solution AI Fabric, calculated, on the one hand, for scenarios of increasing network performance without losses (up to 200 Gbps per server port in 2020), on the other hand, for increasing the performance of the applications themselves (transition to RoCEv2).

By the way, we had a separate detailed post about the technical component of AI Fabric.

What needs optimization

Before talking about algorithms, it makes sense to clarify what exactly they are designed to improve.

Static ECN leads to the fact that with an increase in the number of sending servers with a single recipient, a suboptimal traffic pattern emerges (to put it mildly, we are dealing with the so-called many-to-one incast model).

In the traditional Ethernet we will have to manually strike a balance between the probability of losses on the network and the low performance of the network itself.

We will see the same prerequisites also when using the bundle PFC / ECN in case of implementation without permanent tuning (see the figure below).

To solve the described problems, we use the AI ​​ECN algorithm, the essence of which is to timely change the ECN thresholds. How it looks is shown in the diagram below.

Previously, when we used the Broadcom chipset + Ascend 310 AI processor bundle, we had a limited number of options for tuning these parameters.

We can conditionally call this option Software AI ECN, since the logic is done on a separate chip and is already being “spilled” into a commercial chipset.

The models equipped with the Huawei P5 chipset have much wider “AI capabilities” (especially on the latest release), due to the fact that it implements a significant part of the functionality necessary for this.

How we use algorithms

Using the Ascend 310 (or the P-card’s built-in module), we begin to analyze the traffic and compare it to a benchmark of known applications.

In the case of known applications, traffic metrics are optimized on the fly; in the case of unknown applications, the next step is taken.

Key points:

  1. DDQN reinforcement learning, exploration, accumulation of many baseline configurations, and exploration of the best ECN compliance strategy is performed.
  2. The CNN classifier identifies scenarios and determines if the recommended DDQN threshold is reliable.
  3. If the recommended DDQN threshold is not reliable, a heuristic method is used to correct it to ensure that the solution is generalized.

This approach allows you to adjust the mechanisms for working with unknown applications, and if you really want to, you can set a model for your application using the Northbound API to the switch management system.

Key points:

  1. DDQN accumulates a large number of baseline configuration memory samples and deeply examines the network state and baseline configuration reconciliation logic to learn policies.
  2. The CNN Neural Network Classifier identifies scenarios to avoid risks that can arise when unreliable ECN configurations are recommended in unknown scenarios.

What do we get

After such a cycle of adaptation and changing additional network thresholds and settings, it becomes possible to get rid of several types of problems at once.

  • Performance issues: low bandwidth, long latency, packet loss, jitter.
  • PFC issues: PFC deadlock, HOL, storms, etc. PFC technology causes many system level problems.
  • RDMA Application Challenges: AI / High Performance Computing, Distributed Storage and Combinations. RDMA applications are sensitive to network performance.


Ultimately, additional machine learning algorithms help us solve the classic problems of the “unresponsive” Ethernet network environment. Thus, we are one step closer to an ecosystem of transparent and convenient end-to-end network services – as opposed to a set of disparate technologies and products.


Huawei solutions continue to appear in our online library… Including on the topics covered in this post (for example, before building full-size AI solutions for various scenarios of “smart” data centers). You can find a list of our webinars for the coming weeks at link

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *