Emoji under the hood
Over the past few weeks Nikita Prokopov implemented emoji support for Skija… He decided to share a few small details of how this is “the greatest innovation in human communication since the invention of the letter »Works under the hood.
Translator’s note: Habr does not support emoji, so I had to get out and replace emoji with pictures.
Unicode
Each character on a computer is encoded with a number. The most popular encoding is Unicode, and the two most common subvariants are UTF-8 and UTF-16.
Unicode highlights 221 (2 million) characters called “codepoints”. Of these two million, only ~ 150k characters are currently defined. All languages, living and dead, and other decorations were crammed into these 150,000 symbols. You can use different fonts, write backwards and upside down: and also display “GHz” as a single glyph: …
Two-headed arrow directed to the right with plumage and two vertical strokes: or a seven-eyed monster: … And the duck:
Pay attention to the block with Egyptian hieroglyphs (U + 13000 – U + 1342F), there are many interesting things:
Basic emoji
Emoji are just Unicode characters, which are located here U + 1F300-1F6FF and here U + 1F900-1FAFF:
Emoji behave like ordinary letters, you can do all the operations with them, as with letters (approx. per .: just not on Habré!). When you type “A,” the computer sees U + 0041. When you type the computer sees U + 1F335.
Emoji are fonts
Why are they displayed as pictures? Bitmap fonts. you can create funny png for glyphs instead of boring black and white vectors.
Each OS comes with a pre-installed emoji font. On macOS / iOS, this is the Apple Color Emoji. Windows – Segoe UI Emoji, Android – Noto Color Emoji.
Emojis, like fonts, look different on different devices. Some applications have their own emoji: WhatsApp, Twitter, Facebook.
Fallback fonts
You write the text in some font, how does the emoji fit there? And why does the Russian text look poor in the Clubhouse or on Medium?
Here you are typing the character U + 1F419, and your font is, for example, San Francisco. But the San Francisco font does not have a glyph for U + 1F419, so your OS starts looking for another font that has such a glyph.
U + 1F419 is only available in Apple Color Emoji. So you see this:…
Whichever font you use, emojis look the same.
Variation selector-16
Some emojis originated in the form of icons back in 1993, in the Miscellaneous Symbols U + 2600-26FF or Dingbats U + 2700-27FF sections:
These glyphs are just like letters, black and white. Many fonts have their own (U + 2702 BLACK SCISSORS):
Apple Color Emoji has its own version:
How the OS understands what to display “alt =” image “/> or if they have the same code U + 2702?
Meet U + FE0F, also known as VARIATION SELECTOR-16. This is a hint for the text renderer to switch to emoji.
Simple, elegant and no need to highlight new codepoints. have the same valuebut slightly different image style.
Grapheme clusters
Here we are faced with another problem – our emoji are now not one codepoint, but two. This means we need a way to define the boundaries of the symbol.
A cluster of graphemes will help us. A grapheme cluster is a sequence of codepoints that is viewed as a single human-readable glyph.
Grapheme clusters were invented not only for emojis, they are applicable to regular alphabets as well. Is a single cluster of graphemes, even if it consists of two codepoints: U + 0055 UPPER-CASE U followed by U + 0308 COMBINING DIAERESIS.
Grapheme clusters pose a lot of complexity for programmers. You can’t just do substring(0, 10)
to take the first 10 characters – you can split the emoji in half.
The reverse of the line must be done cleverly. U + 263A U + FE0F makes sense, but U + FE0F U + 263A doesn’t.
Finally, you cannot just call .length
for string. Well, you can, but the result will surprise you. If you are a developer try in your browser console.
Programmer tip: If you are working with text, get a library focused on grapheme clusters. For C, C ++ m and JVM, it could be ICUSwift does everything right by default, for others – do it yourself.
This thing is 65 in length and cannot be split. Live with it now.
Skin Tone Modifier
Most human emojis depict an abstract yellow person. When skin tone was added in 2015, instead of adding a new codepoint for each emoji and skin tone combination, only five new codepoints were added: U + 1F3FB..U + 1F3FF
They should not be used on their own, but should be added to existing emoji. Together they form a ligature: (U + 1F44B WAVING HAND SIGN) and then (U + 1F3FD MEDIUM SKIN TONE MODIFIER), we get
does not have its own codepoint (this is a sequence of two: U + 1F44B U + 1F3FD), but has its own unique look and feel. In total, with the help of five modifiers, ~ 280 human emojis were transformed into 1680 variations. Here are some dancers:
Zero-width Joiner
Let’s say your friend just sent you a photo of an apple she is growing in her garden. You need to answer — how? You can send WOMAN EMOJI (U + 1F469), with attached rice spike SHEAF OF RICE (U + 1F33E). As a result, it will turn out , but if you stick U + 200D between them, you get a farmer:
U + 200D is called Zero-width Joiner, or ZWJ for short. It works in a similar way to what we saw with skin tone, but this time you can combine two self-contained emojis into one. Not all combinations work, but many do, sometimes in surprising ways!
Some examples:
One weird inconsistency I noticed is that hair color is done through ZWJ, while skin tone is just an emoji modifier without ZWJ. Why? I have no idea.
Unfortunately, some emojis are not implemented as combinations with ZWJ. I consider these missed opportunities:
How to print ZWJ? No way. But you can copy it from here: “”. Note: This is a special character, so expect it to behave strangely. You do not see him, but he is. (approx per: in the original article there is, but Habr does not allow)
Another big area where ZWJ is on a horse is in the configuration of families and relationships. Here’s a short story to illustrate:
Flags
Country flags are part of the Unicode standard, but for some reason are not implemented on Windows. If you are reading this in a Windows browser – Sorry!
Flags do not have dedicated codepoints. Instead, they are two-letter ligatures.
Left – Windows, right – Mac
True, they don’t use real letters. Instead, the “regional indicator symbol letter” alphabet (U + 1F1E6..1F1FF) is used. These letters are not used for anything other than composing flags.
What happens if you put two random letters together? Not so much: (except that text editing starts to behave strangely).
If you want to experiment, feel free to copy and combine from this alphabet:
There are 258 valid two-letter combinations. Can you find them all?
A fun side effect of the two-letter ligature:
Sequences of tags
Two-letter ligatures are cool, but don’t you want to be cooler? How about 32 letter ligatures? Here are the tag sequences.
A tag sequence is a sequence of regular emoji, followed by another type of Latin letters (U + E0020..E007E), ending with U + E007F CANCEL TAG.
They are currently only used for these three flags: England, Scotland and Wales:
Keycaps
Not super-exciting, but necessary for completeness: Keycaps sequences use another convention.
It looks like this: take a number * or #, turn it into an emoji with U + FE0F, wrap it in a square with U + 20E3 COMBINING ENCLOSING KEYCAP
There are 12 of them:
Unicode updates
Unicode is updated every year and emoji are a core part of every release. For example, in Unicode 13 (March 2020) 55 new emojis were added.
At the time of this writing, neither the latest Mac OS (11.2.3) nor iOS (14.4.1) support Unicode 13 type emojis:
Here’s what I see in March 2021:
But thanks to the magic of ZWJ, I can still figure out what’s going on, just not in the most optimal way.
Conclusion
To summarize, there are seven ways to encode emoji:
- Single codepoint
- Single codepoint + variation selector-16
- Skin Tone Modifier
- Sequencing with a zero-width joiner
- Flags
- Sequence of tags
- Keycap sequence
Methods from 1-4 can be combined to build a rather complex post:
If you are a programmer, remember to always use the ICU library for:
- extraction of substring
- measuring line length
- reverse string
The googling keyword is “Grapheme Cluster”. This applies to emojis, Western diacritics, induced and Korean fonts, so please be careful.