202 trillion characters
The StorageReview Lab team has once again broken world records for calculating Pi. Now the number of digits has reached 202,112,290,000,000. The previous record belongs to the same team and is 105 trillion.
An unprecedented computational feat
To achieve this goal, the StorageReview Lab team used a high-end system. They used Intel Xeon 8592+ processors and Solidigm P5336 61.44 TB NVMe SSDs. The team ran nearly continuous compute for 85 days, taking up nearly 1.5 petabytes of storage across 28 Solidigm SSDs.
When the StorageReview Lab team achieved its previous record in March 2024, it used a dual-socket AMD EPYC system with 256 cores and nearly petabyte-scale Solidigm QLC SSDs. The team overcame significant technical challenges, including memory and storage limitations. This achievement demonstrated the capabilities of modern hardware and provided valuable insights into optimizing high-performance computing systems.
Computer Science and Mathematics Lesson
When we first started looking for interesting ways to benchmark high-capacity SSDs, an obvious answer was found in our CPU and system reviews: y-cruncher. When using swap space for extensive computation, the ratio of storage space to digits is about 4.7:1. So 100 trillion digits would require about 470 TB of space. Without going too deep into math and computer science, Chudnovsky's y-cruncher algorithm is based on a rapidly converging series derived from the theory of modular functions and elliptic curves. The algorithm is based on the following infinite series:
The main question that arose regarding our calculations for 100 and 105 TT was: “Okay, no big deal. Why does it take so long and use so much memory?” This question was among other troubling concerns about open source and Alex Yee's programming capabilities. Let's take a step back and look at it from a system perspective.
Calculating a large number of digits of pi, such as 100 trillion, requires a significant amount of space due to the large number of arithmetic operations. The main problem is multiplying large numbers, which by its nature requires a significant amount of memory. For example, the best algorithms for multiplying N-digit numbers require approximately 4N bytes of memory, most of which is used for storing data. This memory must be accessed many times during the calculation, which turns the process into a disk I/O-intensive rather than CPU-intensive task.
Chudnovsky's formula, widely used to calculate large numbers of digits of pi, requires a large number of arithmetic operations. These operations of multiplication, division, and squaring often reduce to large multiplications. Historically, supercomputers used AGM algorithms, which, although slower, were easier to implement and benefited from using multiple machines. However, recent advances have shifted the bottleneck from computational power to memory access speed.
Historically, supercomputers used AGM (arithmetic-geometric mean) algorithms, which, although slow, were simpler to implement and more convenient when used with multiple machines. However, recent advances have shifted the bottleneck from computational power to memory access speed.
The processor’s arithmetic logic units (ALUs) and floating-point units (FPUs) handle these large-number multiplications, much like manual multiplication on paper, by breaking them down into smaller, more manageable operations. Pi calculations used to be limited by processing power, but today, processing power has outpaced memory access speed, making storage and reliability critical factors when setting pi records. For example, the performance difference between our 128-core Intel machine and the 256-core AMD Bergamo was negligible; the focus was on disk I/O efficiency.
Solidigm SSDs are critical to these calculations not because of their inherent speed, but because of their exceptional storage density. Consumer NVMe drives can store up to 4TB in a small space, while enterprise SSDs combine these chips for even more capacity. While QLC NAND may be slower than other types of flash memory, the parallelism in these dense SSDs allows for higher aggregate throughput, making them ideal for large-scale pi calculations.
Solidigm QLC NVMe SSDs Make Madness Possible
If you're still awake, let's move on. All you need to know is that when the numbers being calculated are too large to fit in memory, computers must use software algorithms for multiple-precision arithmetic. These algorithms break large numbers into manageable chunks and perform the division using special techniques. This is where solid-state drives come in. Solidigm P5336 NVMe with a capacity of 61.44 TB. y-cruncher takes these manageable fragments, first accumulates them in system memory, and then dumps them into space on the working disk.
Remember, we need a ratio of about 4.7:1 to exchange, since each part of this scary formula needs to be represented by many, many bits.
While you can put in multiple hard drives or object storage, raw size is only one part of a very complex equation, as we've discovered in our first round. The ability to get large and fast enough storage close to the compute device is a recurring topic we’ve been discussing at StorageReview lately as AI advances. Swap space performance is the biggest bottleneck in this compute. Direct-attached NVMe has the best performance, and while some options may have higher throughput per device, our large and very dense QLC array collectively did a great job.
Y-cruncher has a built-in benchmark that lets you pull all the levers and adjust the knobs to find the optimal performance settings for your disk array. This is very important. The screenshot above shows that the benchmark provides feedback for this consumer system, showing CPU speed and SSD performance.
We have extensive documentationbut to summarize, after several weeks of testing we found that the optimal option was to simply let the y-cruncher interact with the disks directly.
We tested network targets, disks behind a SAS RAID card, NVMe RAID cards, and iSCSI targets. When handing over control of the hardware to y-cruncher, the performance before and after is like night and day. iSCSI also seems acceptable, but we only tested it for the output file, which can use “Direct IO” for this interaction. The swap RAID code must be well thought out, and from our testing and conversation with the developer, we can conclude that it works with disks at a low level.
Solidigm’s 61.44TB drives are the best answer to many of the problems in this area. Running the test on our system, we see that the drives perform within specifications for both reads and writes. We specifically chose Intel processors to be able to get as close to the optimal 2:1 drive-to-compute ratio as possible. This is the optimal ratio so you don’t waste CPU time waiting for the drives to start working. As drive technology gets faster, we can run larger, faster runs by choosing processors with more cores.
Dell PowerEdge R760 Custom Server
As they say, thirds are a charm. This isn’t our first Pi record; we learned from our first two iterations to build a better platform for the calculation. Our first build used a 2U server with 16 NVMe bays and three internal SSDs. The 30.72TB Solidigm P5316 SSDs contained the swap storage for the y-cruncher, but we had to use an HDD-based storage server for the output file. This was less than optimal, especially at the end of the write phase.
Our second platform used the same server with an external NVMe JBOF attached, giving us an extra NVMe bay, but at the cost of sensitive cabling and unbalanced performance. The downside of both platforms was the need to rely on external hardware for the entire duration of the y-cruncher, which required additional power and introduced more points of failure.
For this third run, we wanted to use a single server with NVMe Direct Drives and have enough room for y-cruncher swap storage and output storage under a single metal cover. Enter the Dell PowerEdge R760 with a 24-bay NVMe Direct Drives backplane. This platform uses an internal PCIe switch to allow all NVMe drives to communicate with the server at the same time, bypassing the need for additional hardware or RAID devices.
We then assembled a PCIe riser configuration from several R760s in our lab environment, giving us four PCIe slots in the rear for additional NVMe SSDs installed in U.2. A bonus was removing the larger heatsinks from another R760, giving us the maximum amount of turbo headroom possible. Direct liquid cooling arrived in our lab a month later than it needed to be implemented for this launch.
While it's technically possible to order the exact same Dell configuration, we didn't have one, so we had to piece it together.
The size of the power supply was also critical for this run. While most would immediately think that the CPU is consuming most of the power, having 28 NVMe SSDs under one roof has a significant impact on power consumption. Our build used 2400W PSUs, which we found to be barely functioning. We had several near-critical power consumption moments where we would have run out of power if the system had disabled one PSU. This happened very early on in the run, with power consumption skyrocketing as the CPU load peaked and the system increased I/O activity on all the SSDs. If we had this experience to repeat, we would go with the 2800W models.
Performance characteristics
Technical features
Total number of digits counted : 202,112,290,000,000
Equipment used : Dell PowerEdge R760 w/2x Intel Xeon 8592+, 1TB DDR5 DRAM, 28x Solidigm 61.44TB P5336
Software and algorithms : y-cruncher v0.8.3.9532-d2, Chudnovsky
Data store : 3.76 PB of written data per disk, 82.7 PB across 22 disks for swap array
Calculation duration : 100,673 days
y-cruncher telemetry
Largest logical checkpoint: 305,175,690,291,376 (278 TiB)
Peak logical disk usage: 1,053,227,481,637,440 (958 TiB)
Logical Disk Bytes Read: 102,614,191,450,271,272 (91.1 PiB)
Bytes written to logical disk: 88,784,496,475,376,328 (78.9 PiB)
Start date: Tuesday, 16:09:07, 6 February 2024
End date: Monday 05:43:16, 20 May 2024
Pi: 7,272,017.696 seconds, 84.167 days
Total computation time: 8,698,188.428 seconds, 100.673 days
Time from start to finish: 8,944,449.095 seconds, 103.524 days
The latest known digit of pi is 2, which is at position 202,112,290,000,000 (two hundred two trillion, one hundred twelve billion, two hundred ninety million).
Long-term consequences
While calculating pi to such a large number of digits may seem like an abstract task, the practical applications and methods developed during this project have far-reaching implications. These advances could improve a variety of computing tasks, from cryptography to complex simulations in physics and engineering.
A recent calculation of pi to 202 trillion digits highlights significant advances in storage density and total cost of ownership (TCO). Our system achieved an astounding 1,720 petabytes of NVMe SSD storage in a single 2U chassis. This density represents a leap forward in storage capabilities, especially considering that total power consumption peaked at just 2.4 kW under full CPU and storage load.
This energy efficiency contrasts with traditional HPC record-breaking computing, which consumes significantly more power and generates excess heat. Power consumption increases exponentially when you consider additional nodes for scale-out storage systems if you need to expand shared low-capacity storage versus local high-density storage.
Heat management is critical, especially for smaller data centers and server rooms. Cooling traditional HPC recording systems is no easy task, requiring data center chillers that can consume more power than the equipment running alone. By minimizing power consumption and heat output, our setup offers a more sustainable and manageable solution for small businesses. As a bonus, we ran most of our time with fresh air cooling.
To put this into perspective, consider the challenges faced by those running shared network storage and unoptimized platforms. These setups will require one or more data center chillers to maintain the desired temperature. In these environments, every watt saved means less cooling required and lower operating costs. Another major benefit of running a lean and efficient platform for a record run is protecting the entire setup with battery backup equipment.
Overall, this record-breaking achievement demonstrates the potential of modern HPC technologies and highlights the importance of energy efficiency and thermal management in modern computing environments.
Ensuring Accuracy: The Bailey-Borwein-Plouffe Formula
Calculating pi to 202 trillion digits is a monumental task, but ensuring those digits are accurate is equally important. That’s where the Bailey-Borwein-Plouffe (BBP) formula comes into play.
The BBP formula allows us to check the binary digits of pi in hexadecimal without having to calculate all the previous digits. This is especially useful for cross-checking sections of our massive calculations.
Here's a simplified explanation:
Hexadecimal output. We first generate the digits of pi in hexadecimal format during the main calculation. The BBP formula can calculate any arbitrary single digit of pi directly. You can do this with other programs like GPUPI, but y-cruncher also has a built-in function. If you prefer an open-source approach, The formulas are well known.
Cross-check. We can compare these results with our main calculation by calculating specific positions of the hexadecimal digits of pi independently using the BBP formula. If they match, this clearly indicates that our entire sequence is correct. We performed this cross-check over six times; here are two of them.
For example, if our initial calculation yields the same hexadecimal digits as those obtained from the BBP formula at various points, we can be confident that our numbers are accurate. This method is not just theoretical; it has been applied practically in all significant calculations of pi, guaranteeing the reliability and validity of the results.
R= Official result of entry, V= Verification result
Astute readers will notice that the checks in the screenshots and in the comparison above are slightly offset. To make sure the results matched, we also checked a few other places (like the numbers 100 trillion and 105 trillion), although this was not necessary since the error would have been reflected in the hexadecimal number at the end. Although It is theoretically possible to calculate any decimal digit of Pi
using a similar method, it is unclear whether it would have accuracy beyond 100 million digits, or whether it would be computationally efficient at all.
By integrating a mathematical cross-checking process, we can guarantee the integrity of our record-breaking calculations of Pi to 202 trillion digits, demonstrating our computational precision and scientific approach.
Thank you for your attention!