How Unity abandoned its strings

In 2014, the Unity engine had so many critical changes and new features that the “five” was actually a different engine. And although many people didn’t really notice it behind the same façade, the changes affected all components of the engine, from the file system to the rendering. The St. Petersburg office of EA had its own branch of the main repository, lagging behind the master by a maximum of a month. I have already written about various implementations and types of strings in game engines, but Unity had its own implementation, which had both positive and negative sides, which was used in almost all subsystems. They were used to it, knew the weaknesses and bad “use cases” and good “best practices”. Therefore, when this system began to be cut out of the engine, a lot of things broke, and if ordinary users immediately switched to the new version and only observed echoes of the storm, then those allowed to the “body” caught a lot of cool bugs.

The engine implemented fashionable and convenient at that time COW (copy-on-write) – strings, “copy-on-write”. Fashionable, because both Qt and GCC also had their own implementations and promoted them to the standard, but it didn’t happen, and it was good, convenient – when creating and copying such strings, the allocations were actually reduced to zero.

The main difference from the general implementation of such a mechanism in Qt/GCC was partial data copying. Those. if there were two lines “abcde” and “abc”, then the second one referred to the first buffer, but had the required size. At the time of level profiling in `Sims Mobile`, there were about 3k line allocations at the start, and then about 1 new line allocation, every 40-50 frames, actually once a second. All creation and copying of new lines were leveled out by this system, and to understand how cool it all was – for comparison, a similar level on a PC in some internal tech demo on fresh UE4 at the same level gave out about 200 allocations per frame, only on lines. Every frame! Some not-so-recent iPhone 5 simply died in an attempt to digest it all on Anrial.

Why COW

The basic idea of COW (copy-on-write) is to share the same data buffer between different row instances and make a copy only when the data in a particular instance changes. This is called “copy-on-write”, the main cost of such an implementation is additional indirect addressing when accessing string values; Unity has supported COW implementations since the very first version, judging by the commit history. There were stories that Joachim Ante himself (CTO of the company) personally wrote and designed this class, and in general the entire localization system in the engine, the first implementation commits actually dated back to 2006-2007, but there was no authorship there, so I’m selling for what I bought it for .

Why was it removed?

The reason was that the engine code had begun to be rewritten in C++11, new code was being translated in some places to std::string, and a serious discrepancy had arisen between the std::string design and COW’s own implementation. The standard library began to be used more in the engine and in some places this led to situations where they began to work with COW strings as with `const char*` and transfer it as raw data, i.e. in fact, you transferred a raw pointer from shared_ptr and work with it, while the smart pointer itself continued to live its own life. When it would fall was only a matter of a few frames.

A COW row has two possible states: exclusive ownership of the buffer or sharing the buffer with other COW rows. Assignment and copy operations can move it into the shared state and back. But before performing the “write” operation, you need to make sure that the row is in the owned state and this transition leads to the creation of a new copy and copying the contents of the parent data buffer to a new exclusively used buffer.

In a row intended for COW, any operation will be either non-modifying (“read”) or directly modifying (“writing”). This makes it easy to determine whether a string should be brought into ownership state before performing the operation. However, in std::string, references, pointers, and iterators to mutable contents are passed around more freely because each string has exclusive ownership of the buffer, in COW string terms. Even simple indexing of values in a non-const string (s[i]) returns a reference that can be used to modify the string.

Therefore, for a non-const std::string, each such operation can effectively be considered a “write” operation and should be treated as such in a COW implementation. For example, below is the base code of the class that was used in the engine; I will not touch on the problems of initialization from literals. This code shows how assignment and copying have been reduced to almost nothing:

using C_str = const char*;
using C_ref = const char&;

namespace uengine
{
    class UString
    {
        using Buffer = vector<char>;

        shared_ptr<Buffer> m_buffer;
        USize m_length;

        void ensureIsOwning()
        {
            if( m_buffer.use_count() > 1 )
            {
                m_buffer = make_shared<Buffer>( *m_buffer );
            }
        }

    public:
        C_str c_str() const
        { 
          return m_buffer->data();
        }

        USize length() const
        { 
          return m_length;
        }

        C_ref operator[]( const USize i ) const
        { 
          return (*m_buffer)[i]; 
        }

        char& operator[]( const USize i )
        {
            ensureIsOwning();
            return (*m_buffer)[i];
        }
        
        template< USize n >
        UString( Raw_array_of_<n, const char>& literal ):
            m_buffer( make_shared<Buffer>( literal, literal + n ) ),
            m_length( n - 1 )
        {}
    };
}

This uses the default assignment operator, which simply makes a copy of the m_buffer and m_length data. Copying during initialization works in exactly the same way. Now let's look at an example of the correct use of such lines:

int main()
{
    UString str = "Unreal the best engine ever!";
    C_str cstr = str.c_str();
    
    // contents of `str` are not modified.
    {
        const char first_char = str[0];
        auto ignore = first_char;
    }
    
    cout << cstr << endl;
}

The COW string is in the owned state, initializing the first_char variable simply copies the value of the character – everything is fine. But if the developer accidentally, as happened all the time when working with std::string, adds a logical copy of the string, but does not change the value of the string, then problems begin:

int main()
{
    UString str = "Unreal the best engine ever!";
    C_str cstr = str.c_str();
    
    // contents of `str` are not modified.
    {
        UString other = str;
        // .... some works

        const char first_char = str[0];
        auto ignore = first_char;
        // .... some works
    }
    
    cout << cstr << endl;      //! Undefined behavior, cstr is dangling.
}

Because the line str is in a shared state, the COW principle forces the operation str[0] create a copy of the shared buffer to enter the ownership state. Then at the end of the block, the only remaining owner of the original buffer, another row, is destroyed and destroys the buffer. This causes the pointer cstr becomes hanging. This is an example close to real cases, which we caught dozens of during the transition period, the strangest cases were when they mixed std::string and UString and some of the data remained on the stack, they were still available for some time, and at a certain moment they became garbage. As a result, the editor, after thinking a little, produced something in the style of the screenshot below and crashed without dumps.

Godbolt (error example)

This was interpreted as a programmer error and ignorance of the basics of the engine, but in fact the type was simply poorly designed, which made it very easy to misuse. To fix such a bug, if it were to be fixed, to avoid the above cases, it would be necessary to enter the ownership state on any access to a row element, which would entail copying the row data in each case where a reference, pointer or iterator is issued, regardless from the constancy of the string. Attempts to do this in the engine led to the fact that all the positive aspects of using this mechanism disappeared and only negative ones remained in the form of the need to maintain a class and set of algorithms that are not the easiest to implement, and the ability to carefully work with this class.

Somewhere after 4.3 and closer to 4.6, the tech leads admitted that the cost of maintenance had become too high, and the remaining benefits were too small, to continue supporting their implementation of COW strings in the engine. And there they have already arrived in the main compilers string_view and cheap implementation of short strings.

About streams

You may recall a fairly common misconception that COW strings did not work well with threads, or that they were inefficient because with this approach, simply copying a string did not produce an actual copy, and another thread could freely access the data and change it regardless of the main one.

To allow instances of rows that are used by different threads and to allow buffer sharing, almost every access function, including simple indexing with []will have to use a mutex.

In the engine, a simple solution was made by checking the index of the current thread in the assignment operator, and if it did not match, then a new copy of the line was created. This, of course, caused some inconvenience, but such cases were quite rare, and I can’t remember any errors associated with this.

Immutable Strings

This data type performed best on immutable strings, such as string hashes, identifiers and keys, which were the vast majority in the engine code. This is when the strings do not imply operations where data changes occur. Strings can still be assigned, but you cannot directly change the string data, such as replacing the “H” with a “B” in the word “Hurry”. In the case of COW strings in the engine, they supported amortized constant-time initialization from string literals via a hash key for comparison operations and various constant-time substring operations, for example as a key in map. And this was probably the biggest advantage of such COW strings – the absence of string comparison operations when searching in an array or map'e . In the top five, development began to move away from bicycles and custom solutions, even if this led to decreased performance and increased memory consumption, as is the case with standard library containers. Now the engine is completely based on the standard library.

ZY Since 2017, I have not been involved in the development of the engine, but it is unlikely that the adopted course towards unifying software solutions has changed too much.

Thank you for reading!