The development team proposes to switch to UTF-8

Two weeks ago, the manifesto of a group of programmers from Tel Aviv was published on Hacker News. Engineers have proposed making UTF-8 the default solution for storing text strings in memory and for exchanging packets over the network. Specialists also suggested abandoning UTF-16 in all library APIs, with the exception of Unicode libraries – for example, ICU.

The point of view of Israeli developers was supported by some HN residents, as well as engineers from IBM and the W3C consortium developing technology standards for the World Wide Web.

We decided to understand the situation and consider the arguments of the parties.


Photo – Raphael schaller – Unsplash

Benefits of implementing UTF-8

backward compatibility. All characters in this encoding are represented by one byte and actually duplicate ASCII. This approach simplifies the interaction with old software. At the same time, in UTF-8 there are no null bytes (unlike UTF-16), which makes it possible to work with null characters in strings.

This fact avoids errors when the utility interprets such a value as the end of the file. Conditionally, if UTF-16 data is transferred in the C line, then it may be truncated on the first ASCII character, which the system considers official.

W3C experts also recommend using UTF-8 when developing front-end interfaces. how they write representatives of the organization, other encoding options can lead to problems in the operation of network devices when transmitting ASCII bytes.

Integrity check. Resident Hacker News notedthat the eight-bit Unicode conversion format allows you to catch coding errors in the early stages. In UTF-8, the input stream is read by bytes and interpreted sequentially, and the value code point computed unambiguously. In this case, application developers do not need to worry about the byte order (Little-Endian or Big-Endian)

Interestingly, the question regarding the UTF-8 problem, raised ten years ago on StackOverflow. Then Pavel Radzivilovsky (Pavel Radzivilovsky) – one of the authors of the manifesto – noted that his company, which develops software in the field of photogrammetry, completely switched to working with this encoding. One of the reasons was the simplicity of interacting with code that passes strings using char *.

Low memory savings. Most Unicode Code Points occupy same number of bytes in UTF-8 and UTF-16. For example, this fact is true for Russian, Greek languages, as well as Hebrew. However, Latin letters along with punctuation and ASCII characters need less memory in UTF-8.

But there are arguments against

Among the residents of Hacker News there were those who called unsuccessful the idea of ​​”universal” transition to UTF-8. Representation of public names (e.g. folder paths) in this encoding can create cybersecurity risks. Using it, you can create identifiers similar to each other (file names), which opens up new phishing attack vectors for hackers.


Photo – Gemma evans – Unsplash

It is also believed that memory savings are not achieved for all languages. In particular, Asian characters require more space for encoding in UTF-8. However, engineers from IBM they saythat even in this case the increase in the size of the text file compared to UTF-16 will be insignificant. For example, a Japanese ideogram denoting a “tree” takes 3 bytes in UTF-8. But for encoding an English word tree 4 bytes will be required.

In general, the discussion related to the introduction of a single encoding has been going on for a long time. And for the widespread dissemination of coding, large IT companies will have to take the first steps in this direction. And there is already progress – at Microsoft recently recommend Use UTF-8 when developing web applications.

In our blog on the company website:

Situation: Do AdTech companies violate GDPR?
The era of 10nm chips – who develops such processors and what awaits the industry in the future
How new features of the 1cloud panel help the client: the experience of Complex Oil
“Hide www”: why mainstream browser developers again refused to display the subdomain
“How We Build IaaS”: 1cloud materials

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *