Emoji under the hood

image

Over the past few weeks Nikita Prokopov implemented emoji support for Skija… He decided to share a few small details of how this is “the greatest innovation in human communication since the invention of the letter image»Works under the hood.

Translator’s note: Habr does not support emoji, so I had to get out and replace emoji with pictures.

Unicode

Each character on a computer is encoded with a number. The most popular encoding is Unicode, and the two most common subvariants are UTF-8 and UTF-16.

Unicode highlights 221 (2 million) characters called “codepoints”. Of these two million, only ~ 150k characters are currently defined. All languages, living and dead, and other decorations were crammed into these 150,000 symbols. You can use different fonts, write backwards and upside down: imageand also display “GHz” as a single glyph: image

Two-headed arrow directed to the right with plumage and two vertical strokes: image or a seven-eyed monster: image… And the duck:

image

Pay attention to the block with Egyptian hieroglyphs (U + 13000 – U + 1342F), there are many interesting things:

image

Basic emoji

Emoji are just Unicode characters, which are located here U + 1F300-1F6FF and here U + 1F900-1FAFF:

image

Emoji behave like ordinary letters, you can do all the operations with them, as with letters (approx. per .: just not on Habré!). When you type “A,” the computer sees U + 0041. When you type image the computer sees U + 1F335.

Emoji are fonts

Why are they displayed as pictures? Bitmap fonts. you can create funny png for glyphs instead of boring black and white vectors.

image

Each OS comes with a pre-installed emoji font. On macOS / iOS, this is the Apple Color Emoji. Windows – Segoe UI Emoji, Android – Noto Color Emoji.

Emojis, like fonts, look different on different devices. Some applications have their own emoji: WhatsApp, Twitter, Facebook.

image

Fallback fonts

You write the text in some font, how does the emoji fit there? And why does the Russian text look poor in the Clubhouse or on Medium?

image

Here you are typing the character U + 1F419, and your font is, for example, San Francisco. But the San Francisco font does not have a glyph for U + 1F419, so your OS starts looking for another font that has such a glyph.

U + 1F419 is only available in Apple Color Emoji. So you see this:image
Whichever font you use, emojis look the same.

image

Variation selector-16

Some emojis originated in the form of icons back in 1993, in the Miscellaneous Symbols U + 2600-26FF or Dingbats U + 2700-27FF sections:

image

These glyphs are just like letters, black and white. Many fonts have their own image (U + 2702 BLACK SCISSORS):

image

Apple Color Emoji has its own version:

image

How the OS understands what to display image“alt =” image “/> or image if they have the same code U + 2702?

Meet U + FE0F, also known as VARIATION SELECTOR-16. This is a hint for the text renderer to switch to emoji.

image

Simple, elegant and no need to highlight new codepoints. image have the same valuebut slightly different image style.

Grapheme clusters

Here we are faced with another problem – our emoji are now not one codepoint, but two. This means we need a way to define the boundaries of the symbol.

A cluster of graphemes will help us. A grapheme cluster is a sequence of codepoints that is viewed as a single human-readable glyph.

Grapheme clusters were invented not only for emojis, they are applicable to regular alphabets as well. image Is a single cluster of graphemes, even if it consists of two codepoints: U + 0055 UPPER-CASE U followed by U + 0308 COMBINING DIAERESIS.

Grapheme clusters pose a lot of complexity for programmers. You can’t just do substring(0, 10)to take the first 10 characters – you can split the emoji in half.

The reverse of the line must be done cleverly. U + 263A U + FE0F makes sense, but U + FE0F U + 263A doesn’t.

image

Finally, you cannot just call .length for string. Well, you can, but the result will surprise you. If you are a developer try image in your browser console.

Programmer tip: If you are working with text, get a library focused on grapheme clusters. For C, C ++ m and JVM, it could be ICUSwift does everything right by default, for others – do it yourself.

image

This thing is 65 in length and cannot be split. Live with it now.

Skin Tone Modifier

Most human emojis depict an abstract yellow person. When skin tone was added in 2015, instead of adding a new codepoint for each emoji and skin tone combination, only five new codepoints were added: U + 1F3FB..U + 1F3FF

They should not be used on their own, but should be added to existing emoji. Together they form a ligature:image (U + 1F44B WAVING HAND SIGN) and then (U + 1F3FD MEDIUM SKIN TONE MODIFIER), we get image

image does not have its own codepoint (this is a sequence of two: U + 1F44B U + 1F3FD), but has its own unique look and feel. In total, with the help of five modifiers, ~ 280 human emojis were transformed into 1680 variations. Here are some dancers:

image

Zero-width Joiner

Let’s say your friend just sent you a photo of an apple she is growing in her garden. You need to answer — how? You can send image WOMAN EMOJI (U + 1F469), with attached rice spike image SHEAF OF RICE (U + 1F33E). As a result, it will turn out image, but if you stick U + 200D between them, you get a farmer: image

U + 200D is called Zero-width Joiner, or ZWJ for short. It works in a similar way to what we saw with skin tone, but this time you can combine two self-contained emojis into one. Not all combinations work, but many do, sometimes in surprising ways!

Some examples:

image

One weird inconsistency I noticed is that hair color is done through ZWJ, while skin tone is just an emoji modifier without ZWJ. Why? I have no idea.

image

Unfortunately, some emojis are not implemented as combinations with ZWJ. I consider these missed opportunities:

image

How to print ZWJ? No way. But you can copy it from here: “”. Note: This is a special character, so expect it to behave strangely. You do not see him, but he is. (approx per: in the original article there is, but Habr does not allow)

Another big area where ZWJ is on a horse is in the configuration of families and relationships. Here’s a short story to illustrate:

image

Flags

Country flags are part of the Unicode standard, but for some reason are not implemented on Windows. If you are reading this in a Windows browser – Sorry!

Flags do not have dedicated codepoints. Instead, they are two-letter ligatures.

image

Left – Windows, right – Mac

True, they don’t use real letters. Instead, the “regional indicator symbol letter” alphabet (U + 1F1E6..1F1FF) is used. These letters are not used for anything other than composing flags.

What happens if you put two random letters together? Not so much: image (except that text editing starts to behave strangely).

If you want to experiment, feel free to copy and combine from this alphabet:image

There are 258 valid two-letter combinations. Can you find them all?

A fun side effect of the two-letter ligature: image

Sequences of tags

Two-letter ligatures are cool, but don’t you want to be cooler? How about 32 letter ligatures? Here are the tag sequences.

A tag sequence is a sequence of regular emoji, followed by another type of Latin letters (U + E0020..E007E), ending with U + E007F CANCEL TAG.

They are currently only used for these three flags: England, Scotland and Wales:

image

Keycaps

Not super-exciting, but necessary for completeness: Keycaps sequences use another convention.

It looks like this: take a number * or #, turn it into an emoji with U + FE0F, wrap it in a square with U + 20E3 COMBINING ENCLOSING KEYCAP

image

There are 12 of them:

image

Unicode updates

Unicode is updated every year and emoji are a core part of every release. For example, in Unicode 13 (March 2020) 55 new emojis were added.

At the time of this writing, neither the latest Mac OS (11.2.3) nor iOS (14.4.1) support Unicode 13 type emojis: image

Here’s what I see in March 2021: image

But thanks to the magic of ZWJ, I can still figure out what’s going on, just not in the most optimal way.

Conclusion

To summarize, there are seven ways to encode emoji:

  1. Single codepoint image
  2. Single codepoint + variation selector-16 image
  3. Skin Tone Modifier image
  4. Sequencing with a zero-width joiner image
  5. Flags image
  6. Sequence of tags image
  7. Keycap sequence image

Methods from 1-4 can be combined to build a rather complex post:

image

If you are a programmer, remember to always use the ICU library for:

  • extraction of substring
  • measuring line length
  • reverse string

The googling keyword is “Grapheme Cluster”. This applies to emojis, Western diacritics, induced and Korean fonts, so please be careful.
image

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *