Take it easy, CUDA – Intel announces 7nm GPU for data centers

According to analysts, the market for data centers in the coming years will grow by 38% per year and will grow to $ 35 billion over five years, and the most resource-intensive niche (in terms of computing intensity) is deep learning, neural networks and AI tasks.

Of course, Intel is not going to be indifferent to watch how Nvidia (and AMD, to a lesser extent) with its GPUs capture this market, including the fastest growing sector. Last week, the microelectronic industry giant made several high-profile announcements at once:

  • processors for the Nervana NNP-T1000 and NNP-I1000 neural networks (NNP: neural network processors), as well as the Movidius VPU chip;

  • 10nm Xeon Scalable processors (codenamed Sapphire Rapids);
  • unified programming interfaces oneAPI (for CPU, GPU, FPGA) – a competitor to Nvidia CUDA;
  • 7nm GPU for data centers, code-named Ponte Vecchio on the new architecture Xe.

Aurora Computing Modules

On these CPUs, GPUs and oneAPI they will compose Aurora computing modules for the eponymous supercomputer with a performance level of 1 exaflops (10 ^ 18 operations per second). It is assumed that this machine will be installed in the Argonne National Laboratory of the US Department of Energy.

Each compute module has two Sapphire Rapids processors and six GPUs connected via the CXL bus.

According to AnandTech estimates, in a system of 200 racks, as stated, if you subtract the reserve for the network and drives, approximately 2400 two-unit Aurora nodes will fit. That is a total of about 5,000 Sapphire Rapids processors and 15,000 Ponte Vecchio. If we divide the declared performance of 1 exaflops by the number of GPUs, then about 66.6 teraflops per GPU comes out. Further, assuming a CPU performance of 14 teraflops, we still get about 50 teraflops, that is, this is a five-fold increase in GPU performance in data centers by 2021.

Of course, plans are not limited to a supercomputer for the Department of Energy. Intel announced that Lenovo and Atos are already preparing for the release of server platforms based on Xeon CPU, Xe GPU and oneAPI. Thus, Aurora computing modules in some form will find application in other data centers.

The supercomputer should be launched in 2021. At the same time, 7-nanometer Xe GPUs should hit the market.

According to Intel, now traditional high-performance solutions (HPC) converge with AI, moving to workloads that use deep learning. HPC, AI and analytics are the three main workloads that drive demand for computing resources: “Such a variety of computing needs encourages heterogeneous computing. Said Rajeeb Hazra, vice president and general manager of Intel Enterprise and Government. – Universal solutions are no longer suitable here. In this era of convergence, you should look at architectures that are tuned to the different needs of different types of workloads. ”

GPU for data centers

Ponte Vecchio – the first GPU on the new X architecturee. The architecture itself will become the basis for the GPU in various segments:

  • high performance computing;
  • deep learning;
  • cloud computing;
  • graphic arts;
  • media transcoding;
  • workstations
  • gaming computers;
  • regular desktop PCs;
  • mobile and ultramobile devices.

Ari Rauch, Intel's vice president of architecture, graphics, and software, says one GPU architecture will give developers a “common structure,” but as part of this architecture, the company is developing “a lot of microarchitectures that provide maximum effective performance for each these workloads. ”

Ponte Vecchio GPU Based on X Microarchitecturee specifically for HPC and AI, the microarchitecture features include a flexible parallel matrix engine with vector matrices, high throughput double precision floating point (FP64) computing and ultra-high throughput cache and memory. For INT8, Bfloat16 and FP32 formats, there will be a separate Matrix Engine for parallel processing of matrices (possibly an analog of TensorCore), and for FP64, the acceleration will be up to 40 times for each computing unit.

“This workload requires high computing performance, so we focused on adding a large number of vector and matrix modules and parallel computing that are adapted and optimized for this workload,” Rauch said.

Ponte Vecchio will be the first GPU of the new generation. It implements several new technologies that Intel has been developing in recent years:

  • production process 7 nm;
  • multi-level layout of integrated circuits Foveros 3D;
  • Embedded Multi-Die Interconnect Bridge (EMIB) for bonding multiple crystals on a single substrate;
  • Xe Link on the new CXL interconnect standard (based on PCI Express 5.0) – access to the GPU through a single memory space.


Layered Foveros 3D Integrated Circuits from Intel's December 2018 Presentation

Technical specifications of the chip have not yet been announced. They say that in these GPUs there will be thousands of Executive Units connected via XEMF (XE Memory Fabric) with memory and cache.

The XEMF bus works with the special ultra-fast Rambo Cache cache to eliminate the bottleneck when accessing memory. This cache connects to computing units via Foveros, and EMIB will be used to connect HBM memory.

The combination of SIMT and SIMD approaches specific to the GPU and CPU, respectively, and variable-length vector instructions will provide a significant performance boost in some classes of problems.

Many expect Intel to compete with Nvidia and AMD in the market for data centers and AI. We are talking not only about price competition, but also the emergence of alternative technological platforms, which will spur overall technological progress.

OneAPI: vertex of abstraction for heterogeneous iron

In addition to the announcement of new equipment, Intel has released a beta version of the unified software interface oneAPI. They are designed to facilitate the work of developers who, in order to optimize their programs to the maximum, have traditionally had to switch between different programming languages ​​and libraries using middleware and frameworks.

By default, it is accepted in the industry that at a low level, different code needs to be prepared for each architecture. For example, TensorFlow was initially completely optimized at the time of release for the GPU of a single vendor (for Nvidia CUDA).

"OneAPI is trying to solve these problems by offering a common low-level interface for heterogeneous hardware with uncompromising performance," said Bill Savage, vice president of Intel's architecture, graphics and software division. “So that developers can write programs directly on hardware through languages ​​and libraries common to different architectures and vendors, as well as make sure that middleware and frameworks work on oneAPI and are fully optimized for developers who are at the top of this abstraction.”

Intel touts oneAPI as an “open standard for community and industry support,” which will allow “reuse code across architectures and hardware from different manufacturers.”

The oneAPI specification will include the standard cross-architecture DPC ++ programming language based on C ++ and SYCL, as well as “powerful APIs to accelerate key domain-specific functions”.

In addition to the DPC ++ compiler and the API library, special tools will be released, including VTune Inspector Advisor, a debugger, and a "compatibility tool" for porting CUDA (Nvidia) code to DPC ++.

To stimulate the transition to oneAPI, Intel launched a sandbox in DevCloud to develop and test programs on a number of CPUs, GPUs and FPGAs. Working with the sandbox does not require the installation of any hardware or software.

Meanwhile, Nvidia's revenue for the quarter rose to $ 3 billion, while in the data center market, growth over the three months was 11% ($ 726 million). Sales of V100 and T4 processors are breaking all records. Intel is looking at it from the outside, but we already know what the answer will be. The most interesting is just beginning.

Similar Posts

Leave a Reply