10 billion integers included in the array

How the 64-bit Pharo Smalltalk experiment surprised me.

Storing 10 Billion SmallInteger Instances in an Array in Pharo Smalltalk 13.0

Storage of 10 billion copies SmallInteger in an array in Pharo Smalltalk 13.0

A 32-bit party that doesn't want to end

I haven't programmed in Smalltalk professionally since Y2K, and in those days I only encountered 32-bit versions of Smalltalk. From time to time I try out different things in the Pharo dialect of Smalltalk, which has had a 64-bit version for several years. I've been waiting for a long time for an open source version of Smalltalk with support for a 64-bit address space to appear. I've been using 64-bit versions of Java for the past 18 years.

I’ve been wondering for a long time when Java will make built-in support for large arrays, namely arrays with a length greater than 2³¹-1. The Java Collections Framework and all frameworks that implement built-in collections interfaces such as Collection, List, Set, Map don't support size larger int. The maximum number of values ​​that can be stored in a Java array is 2³¹-1. For all uses and purposes, this creates a limitation that allows us to store just over 2 billion objects or built-in types in an array in Java. There are third party libraries that provide BigArrayssuch as fastutiland also we can simulate large arrays ourselves by using multi-dimensional arrays in Java, but it becomes difficult to manage. It is better to use a time-tested third-party library than to stumble upon difficult-to-test and diagnose bugs.

I was wondering if 64-bit Pharo Smalltalk could store more than 2³¹-1 elements in an array. I know one way to check this. I will need a lot of RAM for this. As luck would have it, I bought a new Mac Book Pro M2 Max last year with 96GB of RAM so I could run experiments and tests with high memory usage.

The new MacBook Pro M3 Max supports up to 128GB of memory. This is a significant jump from last year when I bought a MacBook Pro M2 Max with 96GB RAM. 128GB is 2x the size of my ten year old Mac Pro desktop bucket and 8x the size of a ten year old MacBook Pro. The 2019 Mac Pro supports up to 1.5TB of RAM. The current configuration of the Apple Silicon Mac Pro (2023) offers a configuration of up to 192GB, which is 3 times the size of my 10 year old Mac Pro. I predict that in 10 years we will see 256+ GB of memory in high-end consumer laptops and more than a terabyte on desktops.

Note: Server machines may already have terabytes of memory.

Who needs more than 2 billion elements in an array?

I've never needed to store more than a hundred million items in List over the past 20 years. This is still quite a large number of elements and was necessary for my tasks in 2006. I'm sure there are people who solve problems that require storing huge amounts of data in memory.

There are currently approximately 8.2 billion people on earth. To store references to each person in memory would require 4 Java arrays. Storing 8.2 billion simple objects Person in memory would be very expensive. Under a simple object Person I mean class Person with last name, first name and patronymic in the form String. The array itself would be over 65GB in size (~8.2 billion x 8 bytes per link). Instance costs Person would require significantly more memory than I have available on the MacBook Pro with 96GB of memory. Let's assume approximately 8.2 billion by 32 bytes for instances Personwhich would be ~262GB. In total, we would need 327GB of memory to fit all the people, including their last names, first names and patronymics in memory. We could create a pool of names like Stringwhich would have roughly several hundred million occurrences, so we'd need about 32 GB, maybe more, to store the instances String. At the moment this is not entirely available on regular consumer hardware. This would be possible on high-end servers with terabytes of memory.

This got me thinking. What if we started with an object smaller than Person and instead used e.g. SmallInteger in Pharo Smalltalk. I started experimenting with creating arrays larger than 2³¹-1 in Pharo. As I experiment, I learn that Pharo Smalltalk has significant optimizations for SmallInteger. Instead of storing references to objects SmallIntegerthe values ​​themselves are inlined SmallInteger. It felt like what was promised to us in Project Valhalla value types from the Java world. I figured this out after doing a little digging and experimenting with a simple method sizeInMemory. I didn't immediately understand what was happening when the reported size of the instances SmallInteger was equal to zero. This made me understand that SmallInteger processed in the language in some special way. I was also surprised that SmallInteger exceeded the maximum value int in Java.

Transcript show: SmallInteger maxVal.

// Prints - 1152921504606846975
// SmallInteger  -> 1,152,921,504,606,846,975
// Max Java long -> 9,223,372,036,854,775,807

This meaning is more like long in Java. Meaning Long.MAX_VALUE in Java equals 9,223,372,036,854,775,807.

In this article talks about the maximum value SmallInteger in Smalltalk, and also about how the values ​​themselves are stored instead of references. SmallInteger uses 61 bit instead 64.

The biggest difference between SmallInteger in Smalltalk and long in Java this is what happens when one is added to the maximum value.

In Smalltalk we get LargePositiveInteger. Dynamic typing comes to the rescue.

Transcript cr; show: SmallInteger maxVal.
Transcript cr; show: SmallInteger maxVal class.
Transcript cr; show: SmallInteger maxVal + 1. 
Transcript cr; show: (SmallInteger maxVal + 1) class.

// Prints
// 1152921504606846975
// SmallInteger
// 1152921504606846976
// LargePositiveInteger

When we add 1 to maximum value SmallIntegerSmalltalk dynamically creates instances LargePositiveInteger. This is the advantage of a dynamic language in which everything is an object.

In Java we get a silent but potentially fatal overflow.

System.out.println(BigInteger.valueOf(Long.MAX_VALUE));
System.out.println(BigInteger.valueOf(Long.MAX_VALUE + 1));

// Prints
// 9223372036854775807
// -9223372036854775808

Adding 1 to the maximum value of long causes an overflow and the result becomes negative. We can't dynamically change the type from long to some other in Java. What happened long remains longwhat happened short – also remains short. This is one of those times where static typing draws the short straw. We've learned how to find workarounds for this in Java.

Let's wrap this up and move on to our experiments.

Experiments

I couldn't settle on an implementation in Smalltalk without trying to implement it in Java. Pharo Smalltalk gave me all the tools I needed. I used a combination of libraries fastutil And Eclipse Collections to repeat the experiment in Java. One of the good things about Java is that it has a rich ecosystem where members have created many solutions for most of the problems you might encounter.

Pharo Smalltalk version

I started with 5 billion copies SmallInteger V Array in Pharo. After this worked, I increased the total to 8.2 billion. Everything still worked. Then I tried 10 billion. Everything still worked. I was very surprised. I didn't think I had enough memory to store 10 billion instances. This is because I didn't understand at the time how Smalltalk handles “instances” SmallInteger.

Below is the source code for the 10 billion experiment. You will need 96 GB of memory, of which about 85 must be free to run this code. You can reduce the value to 5 billion if you have 64 GB of memory.

|array sum|

array := (1 to: 10000000000) collect: [ :each | each * 2 ].
sum := array sum.

Transcript cr; show: array size class.
Transcript cr; show: array size.
Transcript cr; show: sum.
Transcript cr; show: sum class.

(1 to: 10000000000 by: 1000000000) do: [ :index | Transcript cr; 
 show: 'index: '; 
 show: index; 
 show: ' value: '; 
 show: (array at: index);
 show: ' ';
 show: (array at: index) class ].

Transcript cr; show: 'Array memory (bytes) '; 
  show: array sizeInMemory. 
Transcript cr; show: 'Elements memory (bytes) '; 
  show: (array sumNumbers: #sizeInMemory). 

The code result is below:

SmallInteger
10000000000
100000000010000000000
LargePositiveInteger
index: 1 value: 2 SmallInteger
index: 1000000001 value: 2000000002 SmallInteger
index: 2000000001 value: 4000000002 SmallInteger
index: 3000000001 value: 6000000002 SmallInteger
index: 4000000001 value: 8000000002 SmallInteger
index: 5000000001 value: 10000000002 SmallInteger
index: 6000000001 value: 12000000002 SmallInteger
index: 7000000001 value: 14000000002 SmallInteger
index: 8000000001 value: 16000000002 SmallInteger
index: 9000000001 value: 18000000002 SmallInteger
Array memory (bytes) 80000000016
Elements memory (bytes) 0

These results show that SmallInteger there are no additional costs for the copies themselves. Their values ​​are inlined in Array. So we only need memory for the Array itself, which is about 80 GB.

Java version with fastutil

Below is the source code for the 10 billion experiment in Java. You will also need 96 GB of memory, of which 85 are free. I added the -Xmx85g argument on the command line. You can also lower the number to 5 billion if you have 64GB of RAM. For a large list long fastutil was used. For summation BigInteger – Eclipse Collections. Why did you need to use BigInteger you will see further.

First of all I added the fastutil library as a Maven dependency.

<dependency>
    <groupId>it.unimi.dsi</groupId>
    <artifactId>fastutil</artifactId>
    <version>8.5.14</version>
</dependency>

I then wrote a test using LongBigArrayBigList to store 10 billion longs. This is roughly equivalent to an array of 10 billion elements SmallInteger in Smalltalk.

@Test
public void fastutilTenBillion()
{
    LongBigArrayBigList longs = new LongBigArrayBigList(10_000_000_000L);
    LongStream.rangeClosed(1, 10_000_000_000L)
            .forEach(l -> longs.add(l * 2));

    long sum = longs.longStream().sum();
    BigInteger bigSum = longs.longStream()
            .boxed()
            .collect(Collectors2.summingBigInteger(BigInteger::valueOf));

    System.out.println("size: " + longs.size64());
    System.out.println("long sum: " + sum);
    System.out.println("BigInteger sum: " + bigSum);

    for (long l = 0; l < 10_000_000_000L; l += 1_000_000_000L)
    {
        System.out.println("index: " + l + " value: " + longs.getLong(l));
    }
}

The results are as follows:

size: 10000000000
long sum: 7766279641452241920
BigInteger sum: 100000000010000000000
index: 0 value: 2
index: 1000000000 value: 2000000002
index: 2000000000 value: 4000000002
index: 3000000000 value: 6000000002
index: 4000000000 value: 8000000002
index: 5000000000 value: 10000000002
index: 6000000000 value: 12000000002
index: 7000000000 value: 14000000002
index: 8000000000 value: 16000000002
index: 9000000000 value: 18000000002

Now, the first thing you'll probably notice if you haven't seen it before is that in Java indexes start at 0, and in Smalltalk they start at 1. These values ​​are correct. We just need to remember this when comparing results.

Another point is that the amounts long And BigInteger are different. But why?

The following test will show us the reasons:

@Test
public void longMaxValue()
{
    BigInteger maxLong = BigInteger.valueOf(Long.MAX_VALUE);
    BigInteger sum = new BigInteger("100000000010000000000");
    System.out.println(maxLong);
    System.out.println(sum);
}

Results:

9223372036854775807    // Max Long Value in Java
100000000010000000000  // Sum of 10 billion long values

The maximum value of long is two digits shorter than the sum. The sum of ten billion numbers multiplied by 2 is a value greater than the maximum value long in Java. So I needed to collect BigInteger for the sum of numbers from String – the value itself is too large for long. I didn't initially intend that this test would result in a value overflow longthis is more typical for values int. Maximum value long HUGE, but, as practice has shown, not huge enough.

When I realized that the amount was incorrect, I decided to try using Collectors2.summingBigInteger() Collector from Eclipse Collections. This did a good job with one drawback – I needed to pack values long from LongStream V Long and then again at BigInteger. Maybe I could figure out a way to just use the method collect on LongStreamif I had continued to deal with the code a little longer, but this would have required mutating the result, so I decided not to bother.

Reflections on Limitations

Most developers currently do not need arrays longer than 2³¹-1 elements. This will most likely be true next year as well. But over the course of 5, 10, 20 years, this figure will gradually affect more and more developers as work with large volumes of data becomes more common. Collections of 2³¹-1 elements can become more difficult to use. We already have ready-made solutions for Java, if we need a larger amount int or long. Sometimes it can be difficult to write fast and correct code when you are limited by primitives or arrays.

In the late 1980s and early 1990s, no one needed more than 640K RAM. We can now purchase 128GB on custom laptops. Our great-grandchildren will probably laugh when they hear how we suffered using 64-bit computing. Hardware trends tell us that progress does not stop.

When we encounter limitations in hardware or software, we have to find solutions. I've worked in memory-constrained environments since I started professional development in the late 1980s. Hardware and software have always found a way to keep up with the times. And when that doesn't happen, we have to get creative.

Java has a lot of important issues that need to be addressed and I'm not sure when the maximum array size issue will become a priority. I'm guessing within the next decade. Until that happens, libraries like fastutil can help solve problems now.

The thing that continues to fascinate me about Smalltalk is its ability to adapt. Add a little more and it would seem that you should be flying off a cliff, but no – Smalltalk will pleasantly surprise you and return a different type that can handle your request. And I often miss this feature of his.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *