System.String is not what it seems. .NET memory representation of strings

The System.String type is one of the most used in development. In this article I would like to talk about the nuances of its implementation, let’s start with the basic information:

  • System.String is a reference type;

  • System.String is an immutable type. In fact, the string cannot be changed (at least not with secure code). All methods like .Trim, .Insert, etc. do not change the contents of the string that was originally referenced, but simply set a link to a new one.

Now let’s look a little deeper. Based on our input, let’s see how it works in memory. Let’s start with the fact that strings (like arrays) are not fixed in size. However, it is worth remembering that an instance of any type cannot occupy more than 2Gb of memory in memory, this restriction applies to both x86 and x64 systems. Normally the GC knows how much memory space an object takes up when it is created. Because it’s based on certain types and properties that don’t change. But this is not our case. Let’s figure it out. Under the cat, the line does not refer to the array with chars, but contains them inside. If we turn to the source, we will find these two wonderful fields.

// The String class represents a static string of characters.  Many of
// the string methods perform some type of transformation on the current
// instance and return the result as a new string.  As with arrays, character
// positions (indices) are zero-based.
[Serializable]
[NonVersionable] // This only applies to field layout
[System.Runtime.CompilerServices.TypeForwardedFrom("mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089")]
public sealed partial class String{
	// Остальной код

	//
	// These fields map directly onto the fields in an EE StringObject.  See object.h for the layout.
	//
	[NonSerialize]
	private readonly int _stringLength;

	// For empty strings, _firstChar will be '\0', since strings are both null-terminated and length-prefixed.
    // The field is also read-only, however String uses .ctors that C# doesn't recognise as .ctors,
    // so trying to mark the field as 'readonly' causes the compiler to complain.
    [NonSerialized]
    private char _firstChar;

	//Остальной код
}

Link to code.

The String class in C# is a managed source file, but most of its code is implemented in C or assembly language. The String.cs file contains 9 methods that are marked as extern and annotated with the MethodImplAttribute attribute with the InternalCall parameter. This suggests that their implementations are provided by the runtime elsewhere.

To understand better, let’s go to layout object.h and see:

/*
 * StringObject
 *
 * Special String implementation for performance.
 *
 *   m_StringLength - Length of string in number of WCHARs
 *   m_FirstChar    - The string buffer
 *
 */
class StringObject : public Object{
  // Остальной код
  
  private:
    DWORD   m_StringLength;
	WCHAR   m_FirstChar;

  // Остальной код
}

Link to code.

It turns out, this is how the GC will see the line in memory:

Exactly so, because the next m_StringLength bytes will be occupied by the data array of this string. In such a case, the actual string data will not be stored in a byte array located elsewhere in memory, and therefore no pointer reference and lookup is required to locate it.

Let’s look at the whole picture, how much memory does a string take up in the end?

8 (sync) + 8 (type) + 4 (length) + 4(extra field) + 2 (null terminator) + 2 * length
It turns out: 26 + 2 * length

  • 8 bytes – SyncBlock, 8 bytes – Method table pointer, everything is clear here.

  • 4 bytes – m_stringLength member, the actual number of characters in the string

  • 4 bytes – extra field, before .NET 4.0 this place was reserved for m_arrayLength. In previous implementations, the length of the string could be different from the length of the array of characters included in it. In subsequent versions, this memory was saved and left empty. More likely to rule out problems with the pinvoke code.

  • 2 * length – two bytes per character, starting at m_FirstChar. Strings are always Unicode encoded. This is very important to know and understand. Treating a string as if it were in a different encoding is almost always a mistake. The Unicode encoded character set contains over 65536 characters. This means that one character (System.Char) cannot span all characters. This leads to the use of surrogates where characters above U+FFFF are represented as two characters in strings. Essentially, the string uses a form of UTF-16 character encoding. Most developers may not need to know much about this, but at least it’s worth knowing about it.

  • 2 bytes – Null terminator. Although strings are not Null-terminated (not to be confused with the null keyword), from an API perspective, a character array is Null-terminated, as this means that it can be passed directly to unmanaged functions without any copying. Provided that the interop specifies that the characters in the string should be marshaled as Unicode.

The main difference between x86 and x64 systems is the size of the DWORD, the memory pointer. On 32-bit systems it is 4 bytes, on 64-bit systems it is already 8 bytes.

But how does the GC allocate memory for an object whose size it doesn’t know? The answer is simple: no way. Usually, the GC allocates memory first, and then the class constructor is called. With String, everything is different, when the type is initialized, the constructor itself allocates memory for the object. Let’s look at an example.

To work with strings, String.Builder or String.Format (which eventually uses String.Builder) is most commonly used. Finally, the StringBuilder.ToString() method is called, which also calls FastAllocateString for the String class internally:

public override string ToString()
{
  //Остальной код
  string result = string.FastAllocateString(Length);
  //Остальной код
}

Link to code.

Let’s consider it in more detail.

// This class is marked EagerStaticClassConstruction because it's nice to have this
// eagerly constructed to avoid the cost of defered ctors. I can't imagine any app that doesn't use string
[EagerStaticClassConstruction]
public partial class String
{
  [Intrinsic]
  public static readonly string Empty = "";

  internal static string FastAllocateString(int length)
  {
      // We allocate one extra char as an interop convenience so that our strings are null-
      // terminated, however, we don't pass the extra +1 to the string allocation because the base
      // size of this object includes the _firstChar field.
      string newStr = RuntimeImports.RhNewString(EETypePtr.EETypePtrOf<string>(), length);
      Debug.Assert(newStr._stringLength == length);
      return newStr;
  }
}

Link to code.

It turns out that this is just a wrapper, let’s find RhNewString:

[MethodImpl(MethodImplOptions.InternalCall)]
[RuntimeImport(RuntimeLibrary, "RhNewString")]
internal static extern unsafe string RhNewString(MethodTable* pEEType, int length);

internal static unsafe string RhNewString(EETypePtr pEEType, int length)
            => RhNewString(pEEType.ToPointer(), length);

Link to code.

This method is marked as external and has the attribute [MethodImpl(MethodImplOptions.InternalCall)], which means it will be implemented by the CLR in unmanaged code. The call stack ends up in a handwritten assembler function:

;; Allocate a new string.
;;  ECX == MethodTable
;;  EDX == element count
FASTCALL_FUNC   RhNewString, 8

        push        ecx
        push        edx

        ;; Make sure computing the aligned overall allocation size won't overflow
        cmp         edx, MAX_STRING_LENGTH
        ja          StringSizeOverflow

        ; Compute overall allocation size (align(base size + (element size * elements), 4)).
        lea         eax, [(edx * STRING_COMPONENT_SIZE) + (STRING_BASE_SIZE + 3)]
        and         eax, -4

        ; ECX == MethodTable
        ; EAX == allocation size
        ; EDX == scratch

        INLINE_GETTHREAD    edx, ecx        ; edx = GetThread(), TRASHES ecx

        ; ECX == scratch
        ; EAX == allocation size
        ; EDX == thread

        mov         ecx, eax
        add         eax, [edx + OFFSETOF__Thread__m_alloc_context__alloc_ptr]
        jc          StringAllocContextOverflow
        cmp         eax, [edx + OFFSETOF__Thread__m_alloc_context__alloc_limit]
        ja          StringAllocContextOverflow

        ; ECX == allocation size
        ; EAX == new alloc ptr
        ; EDX == thread

        ; set the new alloc pointer
        mov         [edx + OFFSETOF__Thread__m_alloc_context__alloc_ptr], eax

        ; calc the new object pointer
        sub         eax, ecx

        pop         edx
        pop         ecx

        ; set the new object's MethodTable pointer and element count
        mov         [eax + OFFSETOF__Object__m_pEEType], ecx
        mov         [eax + OFFSETOF__String__m_Length], edx
        ret

Link to code.

It also shows something else that we talked about earlier. The assembler code actually allocates the memory needed for the string based on the required length passed in by the calling code.

On the one hand, strings are a fundamental type, which is why they should be optimized as much as possible. On the other hand, for such a basic type, strings (and text data in general) are more complex than you might initially think. The information provided in the article is not exhaustive, but will help you better understand the processes that take place under the cut of your projects.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *