SW26010 Pro with 13.8 teraflops

Earlier this year, the National Supercomputing Center in Changsha (China, Hunan Province) launched a new supercomputer based on the Sunway SW26010 Pro processor with 384 cores. It is worth noting that this chip was developed by the Chinese themselves. Read about what kind of processor this is and how powerful it is.


What about performance?

This is not the first Chinese development. But the SW26010-Pro processor, the first information about which appeared in 2021, is a significantly more powerful chip than the previous version – SW26010. The architecture, by the way, remains the same. That is, SW26010-Pro uses a 64-bit RISC platform.

Recently at SC23, the company showed off its finished processors and revealed more details about their architecture and design. It is expected that the new processor will allow China to create powerful supercomputers that are entirely based on processors of its own design. The maximum performance of Sunway SW26010 Pro is FP64 13.8 teraflops. For comparison, the 96-core AMD EPYC 9654 processor has a peak FP64 performance of about 5.4 teraflops.

Chip connects

six groups of nuclei

, Core Group, CG, as well as a protocol processing unit (PPU). Each CG node has 64 computational elements with a 512-bit vector engine. In addition, there is also 256 KB of fast data cache and 16 KB of instruction cache. Accordingly, the Pro version has 384 cores, while the previous generation of the chip had 256.

Also chip includes one element Management Processing Element (MPE) per CG node: this is a superscalar core with out-of-order execution and a vector engine, 32 KB instruction cache and 32 KB L1 data cache, 256 KB L2 cache and a 128-bit DDR4-3200 memory interface.

MPE and CPE use a directory-based protocol that provides consistent data exchange. The technology makes it possible to reduce the amount of information exchanged between cores and also ensures precise interaction. This is important for applications with infrequent access to shared data.

As for frequencies, these are 2.25 GHz for CPE and 2.10 GHz for MPE versus 1.45 GHz (in both cases) for its predecessor. FP64 performance, as mentioned above, reaches a maximum value of 13.8 Tflops FP64 and 27.6 Tflops FP32. The previous model has FP64 speed of 2.9 teraflops, and the AMD EPYC 9654 Genoa processor has 5.4 teraflops.

It’s also worth mentioning that each CG node supports twice the amount of RAM as its predecessor. 16 GB DDR4 instead of 8 GB DDR3 for the SW26010 processor. The maximum amount of RAM is 96 GB. That said, the SW26010 Pro still has limitations in terms of cache and RAM performance. Accordingly, 256 KB of cache per CPE is not enough in the absence of a proper L2 cache, and the dual-channel DDR4-3200 memory subsystem (51.2 GB/s) is barely enough for 64 cores, each of which has a 512-bit vector FPU and provides performance up to 16 Flops/cycle (FP64).

This means that the new processor has two main drawbacks: a weak caching subsystem (which can be mitigated with software optimization, but is expensive in terms of time and money) and not very good memory bandwidth. As a result, it remains to be seen whether it can be used to create systems for solving complex real-life problems, including on supercomputers with performance levels of several exaflops.

But in any case, the SW26010 Pro is a very noticeable improvement over the SW26010, especially in terms of memory capacity, compute density and overall performance. These improvements may indicate the continued development of the supercomputing industry in China

And also an analog processor that is 3000 times faster than the Nvidia A100 GPU


We have already written about this chip. The fact is that a team of scientists from Tsinghua University has created an analog photoelectronic chip. According to the developers themselves, this chip is capable of taking the machine vision industry to a new level. The chip was called ACCEL (All-analog Chip Combining Electronic and Light Computing).

The new development uses technologies and advances from the photonic computing industry, where light is used to process data. In particular, the chip applies both diffractive optical analog computing (OAC) and electronic analog computing (EAC), which can significantly increase energy efficiency and performance.

System energy efficiency is expressed in the ability to produce up to 74.8 quadrillion operations per second per 1 W of power. Compute speed is 4.6 peta operations per second (more than 99% completed), more than three times faster than today’s high-end GPUs. Thanks to a combination of optoelectronic computing and adaptive learning, ACCEL is very good at distinguishing objects in images.

The chip developers compared the work of ACCEL and different neural networks implemented on a modern NVIDIA A100 graphics processor for the same task. The results are also noteworthy. Thus, with sequential image processing, ACCEL achieved a latency of 72 ns/frame and power consumption of 4.38 nJ/frame. This is much less than the NVIDIA chip mentioned above. Thus, the NVIDIA A100 has latency and energy consumption of about 0.26 ms/frame and 18.5 mJ/frame, respectively.

In terms of computational speed and power consumption, ACCEL achieved 4.6 petaflops (PFLOPS) of performance in laboratory tests, which is 3000 times faster than the widely used commercial AI chip Nvidia A100, while consuming 4 million times less power . After testing the technology, scientists found out the accuracy. It was 85.5% for Fashion-MNIST problems, 82% for ImageNet 3-class classification, and 92.6% for time-lapse video recognition problems.

Overall, China’s electronics manufacturing and development industry is clearly growing. Of course, there are a huge number of problems: misappropriation of funds, lack of specialists, and sanctions from the United States. Now Chinese scientists and engineers are clearly making very noticeable progress. Well, what will happen next – time will tell.

Other useful materials


Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *