Text strings in programming languages

Long gone are the days when text strings in programming languages ​​were exclusively byte strings without support for characters of national alphabets, and in some cases were also limited to a size of no more than 255 characters. Nowadays, on the contrary, it is difficult to find a programming language that does NOT “support” unicode in text strings.

If you have noticed, the word “supports” is in quotation marks, and as Winipukh said, this is no accident, because with the advent of Unicode, the concept of “character” in text strings has become not entirely unambiguous.

There is an old article about Unicode support issues in different programming languages: The importance of language-level abstract Unicode strings Matt Giuca

The main point of which is to encourage developers of programming languages ​​to abstract from Unicode encoding schemes (by accessing individual bytes), and leave programmers only the ability to work with a sequence of characters in order to prevent most Unicode errors, since with the advent of the Unicode era, the very concept has changed character and text string!

The Unicode Consortium has provided us with an excellent standard for representing and communicating characters from all scripts in the world, but most modern languages ​​unnecessarily reveal the details of how characters are encoded. This means that all programmers must become Unicode experts in order to create high quality internationalized software.

Next-generation languages ​​should only provide character-oriented string operations (unless the programmer explicitly requests a text encoding). Then the rest of us can get back to programming instead of worrying about encoding issues (strings in Unicode).

Terminology

code point is about the same as what we used to call a symbol. But not really. For example, the letter “ё” can be either one code point, or two – the letter “e” and the symbol “two dots above the previous letter”.

code unit are encoding units (utf-8, utf-16 or utf-32)

Accordingly, with the advent of Unicode, the following problems appeared:

  • Text strings can have different code units (UTF-8: code unit = 8 bits or 1 byte, UTF-16: code unit = 16 bits or 2 bytes. UTF-32: code unit = 32 bits or 4 bytes)

  • We have a different number of bytes per code point

  • Some characters can be encoded with a different number of code points, for example, е + ̈ == ̈ё. This increases the size of the data, but only adds one character.

  • Problems with indexing strings (byte or character). Accessing an element of a Unicode character string is now O(n) instead of O(1) as it is with an array, since you have to scan the string to count the number of Unicode characters.

  • It is required to check the correctness of the string data during serialization / deserialization (control of encoding conversion errors / validity of code points)

And these are just the most basic problems when using Unicode! And there are also groups of characters, characters that are not characters, search and sorting, or, for example, modifiers

Modifiers

The zero-width combiner (ZWJ) is a non-printable character in the computer set of some complex fonts such as Arabic or any Indian script. When placed between two characters that would otherwise be unrelated, ZWJ causes them to be printed in their combined form.

The Zero Width Disconnect (ZWNJ) is a non-printable character in computer script sets with ligatures. When placed between two characters that would otherwise be ligatured, ZWNJ causes them to be printed in their final and original forms, respectively. Acts like a space, but is used when it is desired to keep words next to each other or to join a word with its morpheme.

However, the vast majority of programming languages ​​can operate on character strings as byte arrays. And since there are a great many ways to encode Unicode characters, and, accordingly, types of literals for such strings, it has become quite widespread practice to use different modifiers for different encoding formats for text literal strings.

Here are examples of defining different types of strings in C++

    // Character literals
    auto c0 =   'A'; // char
    auto c1 = u8'A'; // char
    auto c2 =  L'A'; // wchar_t
    auto c3 =  u'A'; // char16_t
    auto c4 =  U'A'; // char32_t

    // Multicharacter literals
    auto m0 = 'abcd'; // int, value 0x61626364

    // String literals
    auto s0 =   "hello"; // const char*
    auto s1 = u8"hello"; // const char* before C++20, encoded as UTF-8,
                         // const char8_t* in C++20
    auto s2 =  L"hello"; // const wchar_t*
    auto s3 =  u"hello"; // const char16_t*, encoded as UTF-16
    auto s4 =  U"hello"; // const char32_t*, encoded as UTF-32

    // Raw string literals containing unescaped \ and "
    auto R0 =   R"("Hello \ world")"; // const char*
    auto R1 = u8R"("Hello \ world")"; // const char* before C++20, encoded as UTF-8,
                                      // const char8_t* in C++20
    auto R2 =  LR"("Hello \ world")"; // const wchar_t*
    auto R3 =  uR"("Hello \ world")"; // const char16_t*, encoded as UTF-16
    auto R4 =  UR"("Hello \ world")"; // const char32_t*, encoded as UTF-32

    // Combining string literals with standard s-suffix
    auto S0 =   "hello"s; // std::string
    auto S1 = u8"hello"s; // std::string before C++20, std::u8string in C++20
    auto S2 =  L"hello"s; // std::wstring
    auto S3 =  u"hello"s; // std::u16string
    auto S4 =  U"hello"s; // std::u32string

    // Combining raw string literals with standard s-suffix
    auto S5 =   R"("Hello \ world")"s; // std::string from a raw const char*
    auto S6 = u8R"("Hello \ world")"s; // std::string from a raw const char* before C++20, encoded as UTF-8,
                                       // std::u8string in C++20
    auto S7 =  LR"("Hello \ world")"s; // std::wstring from a raw const wchar_t*
    auto S8 =  uR"("Hello \ world")"s; // std::u16string from a raw const char16_t*, encoded as UTF-16
    auto S9 =  UR"("Hello \ world")"s; // std::u32string from a raw const char32_t*, encoded as UTF-32


  // ASCII smiling face
  const char*     s1 = ":-)";

  // UTF-16 (on Windows) encoded WINKING FACE (U+1F609)
  const wchar_t*  s2 = L"😉 = \U0001F609 is ;-)";

  // UTF-8  encoded SMILING FACE WITH HALO (U+1F607)
  const char*     s3a = u8"😇 = \U0001F607 is O:-)"; // Before C++20
  const char8_t*  s3b = u8"😇 = \U0001F607 is O:-)"; // C++20

  // UTF-16 encoded SMILING FACE WITH OPEN MOUTH (U+1F603)
  const char16_t* s4 = u"😃 = \U0001F603 is :-D";

  // UTF-32 encoded SMILING FACE WITH SUNGLASSES (U+1F60E)
  const char32_t* s5 = U"😎 = \U0001F60E is B-)";

Everyone probably remembers the story about the connection between spaceships and the width of a horse’s croup?

First caught with a rebuttal

About space and horses:

hidden text

The Kennedy spacecraft is flanked by two engines 5 feet wide. The ship’s designers would have liked to make these engines even wider, but could not. Why?

The fact is that these engines were delivered by rail, which passes through a narrow tunnel. The spacing between the rails is standard: 4 feet 8.5 inches, so the designers could only make engines 5 feet wide.

The question arises: why is the distance between the rails 4 feet 8.5 inches? Where did this number come from? It turns out that the railroad in the States was made the same as in England, and in England they made railroad cars on the same principle as trams, and the first trams were made in England in the image and likeness of a horsecar. And the length of the axle of the horse was just 4 feet 8.5 inches!

But why? Because horse-drawn carriages were made so that their axles fell into ruts on English roads, so that the wheels would wear out less, and the distance between the ruts in England is just 4 feet 8.5 inches! Why so? Yes, it’s just that the Romans began to make roads in Great Britain, summing them up to the size of their war chariots, and the axle length of a standard Roman chariot was … that’s right, 4 feet 8.5 inches!

Well, now we got to the bottom of where this size came from, but still, why did the Romans decide to make their chariots with axles of exactly this length? And here’s why: usually two horses were harnessed to such a chariot. And 4 feet 8.5 inches was just the size of two horse asses! It was inconvenient to make the axis of the chariot longer, as this would upset the balance of the chariot.

Therefore, here is the answer to the very first question: even now, when a man has gone into space, his highest technical achievements directly depend on the SIZE OF THE HORSE ASS.

So, it seems to me that for text strings in programming languages, the story also goes from horse ass the initial assumption that a text string and a byte string are the same. And while for UTF-8 encoding it would be almost true, but in general for Unicode strings this is no longer the case!

But since the syntax for writing text strings in programming languages ​​has gone from this initial assumption, and any character strings remain single entities, then at the moment we have what we have.

Main thought

Since the concept of a “string of characters” had already been formed by the time the Unicode era arrived, the developers of programming languages ​​had no choice but to try to adapt to the new reality. For some programming languages, this turned out better, for some worse, but in most cases, problems with conversion, verification, and other delights of text processing fell on the shoulders of programmers.

Although it seems to me that the easiest solution would be to add a new data type to the programming language, Unicode strings with access only by characters, in order to physically separate two representations of text between themselves: a byte array and a sequence of Unicode characters.

This would always explicitly control the conversion of one type of string to another (which removed problems with controlling encoding conversion errors / validity of code points), and would also separate indexing issues for different types of strings. Byte strings are indexed by bytes, character strings are indexed by characters.

Do you think a separate type of Unicode strings is needed in programming languages, or is it superfluous and byte strings with manual data conversion and other related problems will suffice?

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *