5.5 millimeters in 1.25 nanoseconds

It was very interesting and my work was appreciated, so a few months before the release of the console, I received a gift from the project management – a whole silicon wafer of Xbox 360 processors! This 30cm cpu die has been hanging around my home office ever since. In its maximum area, it contained 23 full CPUs horizontally and 20 vertically, but due to

round shape

plates in total there were much less than 460 CPUs. Because of this, many processors were partially or completely excluded – I counted 356 full CPUs on the plate.

The wafer looks cool, it creates beautiful rainbow diffraction patterns, and the individual chips are large enough to show the core elements of a CPU that I’ve been working on for five years.

The Xbox 360 processor had three PowerPC cores with 1 MB L2 cache, and these elements are clearly visible on the plate.

In the image of the die shown above (which is approximately 14 x 12 mm), a repeating pattern of small black rectangles is visible in the lower right corner – this is the L2 cache. To the left of it is one of the CPU cores. Above, but inverted vertically, is another CPU core, and to the right of it and inverted horizontally is the third core. I once remembered all the visible elements of the kernels (L1 caches? register files? arithmetic blocks?), but I have long since forgotten them. However, they are still quite distinct.

In addition to the three CPU cores and the L2 cache, there is a horizontal stripe running down the middle of the chip, and a bit more space to the right of the top right core. I think the horizontal bar is the bus that connects the cores to each other and the L2 cache, and the thing in the upper right corner is presumably off-chip I/O.

The image marked above is a phone photograph of the plate hanging on the wall. I may be biased (or traumatized by working with this willful CPU), but I think being able to see all the elements of a chip is absolutely wonderful.

All this is of course very cool, and you would probably like to hang something similar on your wall, but the story is not over yet. In those days, I wrote a lot of benchmarks to find places where the CPU performance did not match what we expected. I once created a benchmark for our custom memcpy routine and found that the prefetch commands that are missing from the TLB (Translation Lookaside Buffer, also known as the page table cache) were rejected. Since we normally prefetch about 1 KiB (eight cache lines of 128 bytes), this meant that when copying large amounts of memory (where each 4 KiB page results in one TLB miss), approximately 25% of prefetches would be rejected . These cache line reads were then processed sequentially, instead of all eight being processed in parallel, which made copying large amounts of memory nearly three times as long. These prefetch rejects were especially bad because the Xbox 360 CPU was an in-order processor, so it couldn’t do any other work while waiting for non-prefetch reads. This forced me to rewrite the heap to use 64 KiB pages, which avoided most TLB misses and restored expected performance.

Self-portrait taken in reflection of the plate

So, I’ve written a lot of benchmarks.

One benchmark I wrote measured L2 cache latency. This was done as follows: a list of pointers was created, the code went through the pointers and looked at how many cycles it took to traverse the long list. Given the same hardware, it is easy to create a list of pointers that will reside in L1, or will always require going to L2, or will always require going to main memory. Standard system.

I don’t remember what the L2 cache latency was, but I do remember that it was changed depending on the CPU core on which I tested it. The L2 latency from CPU core 0 was four cycles less than the L2 latency from CPU core 1 or 2, to fairly high accuracy.

CPU die with L2-cache to core arrows shown

CPU crystal with arrows from L2 cache to cores

As mentioned above, all communication between the CPU cores and the L2 cache was done through this horizontal stripe in the middle of the chip. This means that all L2 traffic originated from the top of the L2 cache and then traveled to three different CPUs. I zoomed in on the previous die photo and added arrows showing the flow of information from the L2 cache to the three CPU cores. The critical thing here is that the path to the upper right CPU is rather short, while the path to the two CPUs on the left is noticeably longer – the signals must move horizontally across the strip.

The horizontal black lines on the left edge are the millimeter marks on the ruler that I attached to the plate when taking the photo. I rotated this photo of the ruler and used it to measure the length of the horizontal red line, finding it to be 5.5mm. Since the Xbox 360 CPU runs at 3.2 GHz, and CPU cores 1 and 2 have an extra four clock cycles of L2 latency, you can calculate that it takes 1.25 nanoseconds to propagate signals over that 5.5 mm distance!

Light can travel about 30 cm in one nanosecond, so one might wonder. why is this signal moving so slow. I can try to answer this question, but I must warn you that I am a software engineer with no formal background in hardware design, so I could be wrong.

One reason is simply that electrical signals in wires, especially very thin wires, do not moving at the speed of light. Another reason is that the signal does not move continuously. The signal travels a short distance and then is gated off by some sort of valve, and travels a little farther on the next cycle, so the poor signal never gets a chance to accelerate to full speed. Third, the Xbox 360 was designed to run at a higher clock speed, but was released at 3.2 GHz because otherwise it could melt. The situation with 5.5 mm in four cycles is probably an artifact of the times when cycles were designed to be much shorter.

However, 5.5 mm in 1.25 nanoseconds is still about 4400 kilometers per second, which is not very slow. But most of all, I just really like the geeky feeling of joy that I can see with my own eyes this 1.25 nanosecond delay hanging on my wall. Thanks to the Xbox team leadership!

Retreat

The above story of how I became an Xbox 360 CPU specialist is accurate enough. I was never assigned to become a CPU specialist, just given access to the materials and I studied them with rather fanatical tenacity. I remember lying on the floor of my living room when the electricity went out due to the snow, and reading by the light of a lantern until I figured out all the tricks of the assembly line. The same learning pattern was repeated with every area of ​​expertise I worked on: crash analysis, floating point numbers, compiler bugs, CPU bugs. For me it works.

Similar Posts

Leave a Reply