Use Arc instead of Vec

In this article I would like to talk about why you might choose to use Arc<[T]> instead of Vec<T> as the default option in your Rust code.

Use Arc<[T]> instead of Vec

Arc<[T]> could be a very good replacement Vec<T> for immutable data. So, if you are generating a large sequence of data that you will never change later, you might want to think aside Arc<[T]>. It's also really good if you plan to store a lot of these objects; or you plan to simply move or copy them en masse while the program is running.

I'm not talking about now Vec<T>, which you formed as a local variable inside a function or which will be used as something temporary. This is more about cases where you plan to store data for a long period of time, especially if the data implements Clone.

This shouldn't be a big revelation since Clone – something like a super power Arc, which we will talk about briefly now. If you're going to be constantly copying huge sequences of immutable data, Arc can significantly speed up this operation compared to Vec<T>. You can even go further and use Box<[T]>but we'll talk about this at the very end.

Translator's note

Further in the text the term “copying” will often be used, but in this article it means exclusively the operation Clone::clone() – do not confuse with trait Copy!

Why Arc<[T]>?

So why do I recommend Arc<[T]>? There are three good reasons for this. We have already touched on the first one a little. U Arc cheap copying performed in constant time. That is, it doesn't matter how big the data you point to is Arccopying them will always take the same time – the time of incrementing an integer variable and copying the Arc-pointer. And it's very, very fast and doesn't involve any memory allocation that usually occurs when copying Vec. This is a significant optimization that can be obtained simply by switching to Arc<[T]>.

Another reason – Arc<[T]> takes only 16 bytes; it only stores the pointer and size of the data, unlike Vec<T>, which needs to store a pointer, size and capacity, which is a total of 24 bytes. A difference of just eight bytes is not much – but if you store a huge number of such containers, especially in a structure or array, then in the case of Vec<T> this extra space can add up and degrade your cache locality, making traversing your data a little harder and slower.

Finally, the third reason – Arc<[T]> implements the trait Deref<[T]>just like it does Vec<T>. That is, you can perform all the same read-only operations c Arc<[T]>what did you do with Vec<T>. You can take its length, iterate, or access by index. This is important because the first two reasons are good, but I would probably not recommend switching to Arc<[T]>if it were much more difficult to use than Vec<T>but that's not the case thanks Deref<[T]>.

Arc also implements a large number of other traits that you might need, and does it all Arc<[T]> interchangeable with Vec<T> in most circumstances.

Thus, the first two advantages are more likely about some increase in productivity; the third is about what to use Arc<[T]> no more difficult than Vec<T>so you can just do a quick refactoring and switch to Arc<[T]>.

Arc vs String

So we talked about Arc<[T]> compared to Vec<T>. For the rest of the article I will talk about Arc<str> and its advantage compared to String, because this couple shares a similar dichotomy. But using her example it’s a little easier to show the essence and use more real-life examples. But in general, all I can say about Arc<str> vs String will relate directly to Arc<[T]> vs Vec<T>.

It's also worth mentioning that when you plan to use Arcfirst try using Rc. For example, if you don't need the thread safety that gives you Arc. The narrative gives all the laurels Arc just because this is a more general version of the approach, but if it suits you too Rcyou should definitely use it as it has less overhead compared to Arc.

There's one more thing worth clarifying. I want to clarify what I'm talking about Arc<str> – not about Arc<String>. Arc<String> actually has some advantages over Arc<str>, but on the other hand it also has a very significant drawback: you need indirection of as many as two pointers to get to the data you need – the character data of the string. I'll show a visualization of this situation at the end, but Arc<String> overall it's just cumbersome and memory inefficient, and the aforementioned double indirection is a serious performance hit, so I don't recommend using Arc<String>.

Therefore, we will talk specifically about Arc<str>. The power of wide pointers in Rust is that Arc has the ability to point directly to a dynamically allocated text buffer on the heap.

MonsterId

Let's imagine that I'm working on a game and I have a type struct MonsterId(String). Under the hood it is presented as plain text and as we can see I am using Stringto store it.

Translator's note

I have the strong impression that the author of the original video rather meant MonsterTyperather than MonsterIdsince most further examples and uses MonsterId hint that it is not a specific monster created, but rather a type of monster that is meant.

However, we will continue to use MonsterId in the subsequent narration, to pay tribute to the original, while keeping in mind that if your picture doesn’t add up a little, then you can mentally replace MonsterId on MonsterTypeand perhaps you will feel relief.

The first thing I might want to implement for my MonsterIdthis is a pack of typical traits:

#[derive(Clone, Debug, PartialEq, Eq, Hash, Ord, PartialOrd, Serialize, Deserialize, /* ... */)]
struct MonsterId(String);

As you can see, I want to be able to copy our id, display its debugging representation, compare it, hash it. I also want to use serde for serialization and deserialization, possibly from the config or from some save data. Basically, I want to inherit all these traits for my MonsterIdand for it all to work with String.

Next, I want a method for MonsterIdwhich simply outputs the internal text representation as &str. Maybe I want to log it to the console, maybe I want to use it for analytics somewhere else:

fn as_str(&self) -> &str {
	&self.0
}

That is, I just want to reach the internal representation of the data in the form &strand this is easy enough to do if you have String.

Next, I will need a config with data in the form HashMap With MonsterId as a key and some stats as data – well, you know: the monster's damage, its health and stuff like that:

enemy_data: HashMap<MonsterId, EnemyStats>

Key – MonsterIdthat's why I needed the implementation Eq And Hashand it all works great with String as an internal representation MondterId.

The next thing we might want to do is store the IDs of all the enemies that have ever spawned in a single game session in a list:

enemies_spawned: Vec<MonsterId>

Note that I use here Vec<T>since monsters are supposed to be added here as the game progresses and I'll probably have to copy here MonsterId. The list could be significantly larger.

Next, we need some functionality to create a real monster by MonsterId:

fn create(id: MonsterId) -> EnemyInstance {
	...
}

I'm probably going to copy MonsterId it is possible to copy into this function MonsterId inside the monster instance so that each monster knows who it is.

And finally, let's say I will store some statistics inside BTreeMap – for example, how many times a certain MonsterId was killed during a gaming session.

total_destroyed: BTreeMap<MonsterId, u64>

I'm using here BTreeMap just to make it different from what we already use HashMapand we have the option to use MonsterId as a key in BTreeMap because he implements Ord.

In general, there is a sufficient number of different cases; and of course there can be much more of them. This is just an example of how a basic type like MonsterId may begin to permeate your entire system, and you will have to copy it everywhere while the program is running. At some point you may need to use all of these structures above at once, and even more, and at some point all these copy operations MonsterId will begin to affect productivity, and all this will begin to grow like a snowball.

String

So let's look at the cost of copying MonsterId and its total memory consumption when used String as an internal representation. Here's how String represented in memory:

Real text data in String represented as a sequence of characters on the heap, and String allocates enough memory for your text, which is obvious, but it also allocates a little more space so that the line can grow to some limit without re-allocating memory.

We have the string “Goblin”

with an additional four bytes of memory unallocated in case String will be longer, although of course in this particular case we know that won't happen since Goblin is the monster's definitive name.

Next, the object itself String consists of three eight-byte fields:

  • pointer to data ptr

  • data lengths len

  • capacity cap

ptr points directly to the memory allocated for our string, len stores the text size “goblin”, that is, equal to six, and cap includes the size of all memory allocated for the object and in our case is equal to ten.

Now let's see what copying looks like String. First we need to duplicate the entire character array. This will allocate a new chunk of memory on the heap and then copy all the characters into it, which will take linear execution time—in other words, it will take longer the longer your string is. Then a new object will be created String on the stack and it will reference a new char array on the heap.

Note that this time the row reserve space has been discarded because the duplicate row will likely not grow, leaving us with a completely full data buffer. Therefore, in this case we get capequal to six – and it is redundant because it is now identical to len.

If we want to make another copy, we will have to repeat the same thing: allocate memory for a new buffer, copy all the characters of the line into it, and then create an object on the stack String, which will reference it. And we will again have a redundant field that we don’t need cap.

I assume you understand that the process of allocating memory is quite an expensive operation, and objects String taking up more space than we would like for our simple goblin storage task.

Arc

Now, let's look at the same case, but using Arc<str> instead of String. WITH Arc<str> the data on the heap looks a little different than in the case of String:

So, we have two eight-byte fields: one for counting strong links, the other for counting weak links, then comes the actual text data. It looks long and weird at this point, but don't worry, it will get better.

Our object Arc on the stack consists of just a pointer and data length:

It's only 16 bytes because it doesn't have the extra capacity field that it had. String. Let's see what happens when copying:

All we had to do was copy the structure on the stack and increment the reference count, which is now two. Notice that we haven't made any new memory allocation; we didn't do deep copying of text data like in String. We're simply referencing the same text data in two different places.

That is, we can now make copies very, very cheaply:

And the fact that text data is divided between different Arc-objects also increases the chances that they will be in the cache when we request them, because they are loaded into the cache every time I use any of the four Arc-objects from the image above. Unlike Stringwhere the memory we're pointing to would only be loaded into cache if we've recently read from that particular object String.

It can be seen that this approach is generally lighter. Arc-pointers themselves are smaller, meaning more of them can fit into the same amount of memory, and frankly, I think this is just a smarter use of resources for immutable strings that are copied throughout our code.

Also, these two additional fields are − strong And weak – for which we pay on a heap, isolated from each individual we have Arc-object, therefore their presence is, as it were, amortized, whereas in the case of String we are forced to pay for each such field, because it is on the stack. These are two additional fields on the heap – and two, of course, are more than one – but they are still divided between each instance Arc-an object referencing our string. So in the end it's a lot less memory consumption compared to String.

So let's take a look at what the transition looks and feels like. String on Arc<str> For MonsterId:

#[derive(Clone, Debug, PartialEq, Eq, Hash, Ord, PartialOrd, Serialize, Deserialize, /* ... */)]
struct MonsterId(Arc<str>);

Firstly, all of our implemented traits work the same as they worked before. You can copy Arc<str>display debug print Arc<str>, compare it, hash it. You can also serialize and deserialize it. All this works exactly the same as it worked before.

What about accessing text data directly? Method as_str() won't have to change because Arc<str> implements Deref<str> as well as String. We can just take a link to it and it will automatically turn into &str. That is, nothing in this function needs to be changed either – it works perfectly.

Next, what about our dictionary? HashMap<MonsterId, EnemyStats>? Well, our object MonsterId still implementing Eq And Hashbecause Arc<str> also implements them, so nothing needs to be changed here.

In fact, I would argue that because of all this, in this case, the use Arc<str> more justified than using Stringbecause the String gives you an extensive API for modifying the internal text representation, but you cannot modify the key HashMapbecause you might break data invariance. Representing your key as an immutable type like Arc<str>is more appropriate in this situation than String.

Let's move on. My Vec<MonsterId> will now be more cache-friendly, because I removed a whole extra field for each MonsterIdand this is two thirds of the old size MonsterId. My function to create an enemy instance by MonsterIdis probably more efficient now since I guess it involves some degree of copying, which is now done more efficiently with Arc.

And finally our BTreeMap – with her, essentially everything is the same as with HashMap; use of immutable data such as ours Arc<str>just more suitable for our key, since keys BTreeMap cannot be modified in any way. BTreeMap will now be more efficient because it stores less data and does not have to pay for the risks of being mutable when mutability does not make sense. Thus, Arc<str> wins in all cases. It's more efficient and easy to switch to.

Vec and String

Why not use Vec<T> or String? It’s worth making a reservation here – I’ve already hinted, but now I’ll say clearly that Vec<T> And String – for modification. To add and delete, expand and shorten data. And if you don't need it, don't use them because they have an extra cost for what they provide:

fn push(&mut self, ch: char)
fn extend<I>(&mut self, iter: I)
fn pop(&mut self) -> Option<char>
fn retain<F>(&mut self, f: F)
fn truncate(&mut self, new_len: usize)
fn reserve(&mut self, additional: usize)
fn resize(&mut self, new_len: usize, value: T)
fn drain<R>(&mut self, range: R) -> Drain<'_, T, A>
fn clear(&mut self)
fn shrink_to_fit(&mut self)

All these methods Vec<T> And String interesting because they all accept &mut self, that is, they modify their data. If you don't need these features, don't use them. Vec<T> And String.

If you just need to look at some data, calculate its size, see if the data buffer is empty, take the data at an index, iterate through it, partition it, search through it – all these things are provided directly str And [T]both of which you can easily take through Arc:

fn len(&self) -> usize
fn is_empty(&self) -> bool
fn get(&self, index: usize) -> Option<&T>
fn iter(&self) -> Iter<'_, T>
fn split_at(&self, mid: usize) -> (&[T], &[T])
fn strip_prefix<P>(&self, prefix: P) -> Option<&[T]>
fn contains(&self, x: &T) -> bool
fn binary_search(&self, x: &T) -> Result<usize, usize>
fn to_vec(&self) -> Vec<T>
fn repeat(&self, n: usize) -> Vec<T>
impl<T, I> Index<I> for [T]
impl<T> ToOwned for [T]
impl<T> Eq for [T]

So you don't need all the power String And Vec<T>if you only need read-only operations and you can benefit in performance by simply not paying for String And Vec<T>. This is why you might consider using Arc<[T]> instead of Vec<T> or Arc<str> instead of String.

Arc

Now, I want to show you Arc<String>since I mentioned that he is bad, and here's why. Arc<String> starts exactly the same as Stringwith a buffer of text and some extra spare space:

but then we also have to put the String inside Arc:

So you see that at the beginning we have two fields for reference counting, then comes the object data String. An object Arc will point to this data, and copy will copy the link to String. But if we want to get to the “Goblin” text data, we'll have to jump to Stringand then jump from String on “Goblin”, and all these manipulations are simply cumbersome and inconvenient, so Arc<String> is a bad idea when we can just use Arc<str>.

Box

Lastly, I mentioned that if you don't need copying, you can go even further and use Box<str> instead of Arc<str>and this is essentially about as efficient as it can be, assuming you don't need to copy the object.

There is no extra reserved memory here, Box it simply allocates the necessary data on the heap, references it, and knows what size it is. You probably can't do any better than this, but when you copy an object like this, you'll be doing a deep clone of all the data on the heap.

But if your data doesn't support Clone, then this is the best option in terms of memory efficiency. So take a look Box<str>if this is your case.

Afterword from the translator

The moment I decided to translate this video from Logan SmithYouTube algorithms gave me reaction by ThePrimeagen for this very video. This reaction turned out to be noteworthy in that ThePrimeagen asked a fair question that may have surfaced among some readers: “How do you even create or initialize Arc<str>?”.

And indeed, we are accustomed to the fact that we are dealing with the type &str – Not str. What is it in essence strand how to create it, how to work with it? IN documentation about the type str the following is written:

The str type, also called a 'string slice', is the most primitive string type. It is usually seen in its borrowed form, &str. It is also the type of string literals, &'static str.

OK, str is a “string slice”, and in its pure form it is quite difficult to catch, since when declaring a bare non-owned string it will have the type &strNot str:

let hello_world = "Hello, World!"; // тип &str

This confused ThePrimeagen as well, and he mistakenly assumed that Arc<str> can only be created from strings known at compile time, for example:

let s: &'static str = "Hello, World!";
let arc_str: Arc<str> = Arc::from(s);

But to his surprise, the generating code worked in exactly the same way. Arc<str> from String:

let s = String::from("Hello, World!");
let arc_str: Arc<str> = Arc::from(s);

All because for Arc a number of traits have been implemented that, under the hood, do all the magic for us. In particular, these are the traits:

Their implementation allows you to conveniently create Arc with all data types mentioned in the article:

// from String
let unique: String = "eggplant".to_owned();
let shared: Arc<str> = Arc::from(unique);
assert_eq!("eggplant", &shared[..]);
// from &str
let shared: Arc<str> = Arc::from("eggplant");
assert_eq!("eggplant", &shared[..]);
// from Vec<i32>
let unique: Vec<i32> = vec![1, 2, 3];
let shared: Arc<[i32]> = Arc::from(unique);
assert_eq!(&[1, 2, 3], &shared[..]);

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *