The development team proposes to switch to UTF-8
The material generated an active discussion, and we decided to understand the situation, consider the arguments of IT experts – including IBM engineers and W3C consortium specialists.
Photo – Raphael schaller – Unsplash
In 1988, Joe Becker introduced first draft Unicode standard. The document was based on the assumption that 16 bits would be enough to store any character. However, pretty quickly it became clear that this was not enough. Therefore, new encoding options have appeared – including UTF-8 and UTF-16. But the variety of formats and the lack of strict recommendations on their use led to confusion in the IT industry (including terminology).
Windows internal format is Utf-16. At the same time, the authors of the manifesto, which was discussed at Hacker News, they saythat at one time Microsoft used the terms Unicode and widechar as synonyms for UTF-16 and UCS-2 (which is considered original predecessor of UTF-16). As for the Linux ecosystem, it is customary to use UTF-8 in it. A variety of encodings at times leads to the fact that files are damaged when transferring between computers with different operating systems.
The industry’s standardization can be a solution – the transition to UTF-8 for storing text strings in memory or on disk and exchanging packets over the network.
Why UTF-8 is considered better than UTF-16
One of the main arguments is that UTF-8 reduces the amount of memory occupied by characters in the Latin alphabet (they are used by many programming languages). Latin letters, numbers, and common punctuation encoded in UTF-8, only one byte. Moreover, their codes correspond to codes in ASCII, which gives backward compatibility.
Also specialists from IBM they sayUTF-8 is better suited for interfacing with systems that do not expect multibyte data to arrive. Other Unicode encodings contain numerous null bytes. Utilities can find them the end of the file. For example, in UTF-16, the character A looks like So: 00000000 01000001. In a C line, this sequence can be trimmed. In the case of UTF-8, zero is only NUL. In this encoding, the first letter of the Latin alphabet is represented as 01000001 – there are no problems with an unexpected break.
For the same reason, engineers from the W3C consortium recommend use UTF-8 when developing external interfaces. So you can avoid difficulties with the operation of network devices.
Photo – Kristian strand – Unsplash
Resident Hacker News notedthat UTF-8 allows you to catch coding errors in the early stages. In it, bytes are read sequentially, and overhead bits determine their number. Thus, the code point value is calculated unambiguously and application developers do not need to think about the problem Little-Endian or Big-Endian.
Where UTF-16 has the advantage
Latin letters and punctuation can take up less memory in UTF-8 (compared to UTF-16). Some code points require the same number of bytes in both encodings – for example, this fact is true for Greek and Hebrew.
The situation is different with Asian characters – in the case of UTF-8, they need more space. For example, a Chinese character 語 will be represented by three bytes: 11101000 10101010 10011110. The same character in UTF-16 will look like 10001010 10011110.
What is the result
Debate over the problem of introducing a single encoding has been going on for a long time. This question rose almost eleven years ago in a thread on Stack Overflow. Pavel Radzivilovsky (Pavel Radzivilovsky) – one of the authors of the manifesto took part in it. Since then, UTF-8 has already managed to become one of the most popular encodings on the Internet. And her recognized as binding for “all situations” at WHATWG, a community of HTML and API experts developing standards.
Recently, Microsoft also started recommend Use UTF-8 when developing web applications. Perhaps in the future this practice will extend to other utilities.
What else do we publish: