Fundamentals 16 min read

Decoding Emoji: Unicode, Variants, and JavaScript Handling

This article explains how emojis are represented in Unicode, covering basic code points, variation selectors, skin‑tone modifiers, zero‑width joiners, flag ligatures, tag sequences and keycap symbols, and shows how JavaScript can correctly process them using grapheme‑cluster techniques.

Alipay Experience Technology

Apr 23, 2023

Decoding Emoji: Unicode, Variants, and JavaScript Handling

Emoji Encoding Basics

Each emoji corresponds to one or more Unicode code points. For example, the moon emoji 🌔 is U+1F314 (JavaScript UTF‑16 "\uD83C\uDF14"), while the rare Chinese character 𠇔 is U+201D4 ("\uD840\uDDD4").

Special Rules

Variation Selectors (VS‑15, VS‑16) modify the presentation of a character; VS‑16 (U+FE0F) forces emoji style.

Skin‑tone Modifiers (U+1F3FB–U+1F3FF) are appended to a base emoji to create different skin tones, e.g., 👋 + U+1F3FD = 👋🏽.

Zero‑Width Joiner (ZWJ) (U+200D) combines multiple base emojis into a single glyph, such as 👩 + ZWJ + 🌾 = 👩‍🌾.

Flag Symbols are formed by pairing two regional indicator symbols (U+1F1E6–U+1F1FF), e.g., 🇨 + 🇳 = 🇨🇳.

Tag Sequences use invisible tag characters (U+E0061–U+E007A) followed by U+E007F to create custom symbols, like 🏴 + gbeng + U+E007F.

Keycap Sequences combine a symbol (e.g., #, *, 0‑9) with VS‑16 and U+20E3 to produce #️⃣, *️⃣, etc.

Unicode Grapheme Clusters

Unicode treats a sequence of code points that form a single displayed character as a grapheme cluster . This means the visual length of a string can differ from its technical length (the number of UTF‑16 code units). For instance, "工作中👨‍💻" has a visual length of 4 but a technical length of 8.

Accurate string manipulation therefore requires splitting a string into grapheme clusters rather than using .length or .substring directly.

JavaScript Utility Example

function flag(letterStr) {
  const BASE_FLAG = '🏴';
  const TAG_CANCEL = String.fromCodePoint(0xE007F);
  const tagLatinStr = letterStr.toLowerCase().split('').map(letter => {
    const codePoint = letter.charCodeAt(0) - 'a'.charCodeAt(0) + 0xE0061;
    return String.fromCodePoint(codePoint);
  }).join('');
  return BASE_FLAG + tagLatinStr + TAG_CANCEL;
}

The library described provides functions to:

Detect whether a string contains any emoji.

Split a string into grapheme clusters, each exposing isEmoji and physicalLength properties.

Perform visual‑length‑aware substring operations, optionally limiting the physical byte length.

Compatibility Considerations

If a receiving system supports an older Unicode version, complex emojis may render as separate components (e.g., 👨‍💻 may appear as 👨💻). The library always treats the sequence as a single grapheme cluster according to the latest Unicode standard.

References

Full Emoji List v15.0 – unicode.org

Unicode Emoji Data – unicode.org/Public/emoji/15.0/

Emoji under the hood (translation) – taoshu.in

Unicode Emoji Regex – unicode.org

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Emoji Unicode character encoding grapheme cluster

Written by

Alipay Experience Technology

Exploring ultimate user experience and best engineering practices

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.