Decoding Emoji: Unicode, Variants, and JavaScript Handling
This article explains how emojis are represented in Unicode, covering basic code points, variation selectors, skin‑tone modifiers, zero‑width joiners, flag ligatures, tag sequences and keycap symbols, and shows how JavaScript can correctly process them using grapheme‑cluster techniques.
Emoji Encoding Basics
Each emoji corresponds to one or more Unicode code points. For example, the moon emoji 🌔 is U+1F314 (JavaScript UTF‑16 "\uD83C\uDF14"), while the rare Chinese character 𠇔 is U+201D4 ("\uD840\uDDD4").
Special Rules
Variation Selectors (VS‑15, VS‑16) modify the presentation of a character; VS‑16 (U+FE0F) forces emoji style.
Skin‑tone Modifiers (U+1F3FB–U+1F3FF) are appended to a base emoji to create different skin tones, e.g., 👋 + U+1F3FD = 👋🏽.
Zero‑Width Joiner (ZWJ) (U+200D) combines multiple base emojis into a single glyph, such as 👩 + ZWJ + 🌾 = 👩🌾.
Flag Symbols are formed by pairing two regional indicator symbols (U+1F1E6–U+1F1FF), e.g., 🇨 + 🇳 = 🇨🇳.
Tag Sequences use invisible tag characters (U+E0061–U+E007A) followed by U+E007F to create custom symbols, like 🏴 + gbeng + U+E007F.
Keycap Sequences combine a symbol (e.g., #, *, 0‑9) with VS‑16 and U+20E3 to produce #️⃣, *️⃣, etc.
Unicode Grapheme Clusters
Unicode treats a sequence of code points that form a single displayed character as a grapheme cluster . This means the visual length of a string can differ from its technical length (the number of UTF‑16 code units). For instance, "工作中👨💻" has a visual length of 4 but a technical length of 8.
Accurate string manipulation therefore requires splitting a string into grapheme clusters rather than using .length or .substring directly.
JavaScript Utility Example
function flag(letterStr) {
const BASE_FLAG = '🏴';
const TAG_CANCEL = String.fromCodePoint(0xE007F);
const tagLatinStr = letterStr.toLowerCase().split('').map(letter => {
const codePoint = letter.charCodeAt(0) - 'a'.charCodeAt(0) + 0xE0061;
return String.fromCodePoint(codePoint);
}).join('');
return BASE_FLAG + tagLatinStr + TAG_CANCEL;
}The library described provides functions to:
Detect whether a string contains any emoji.
Split a string into grapheme clusters, each exposing isEmoji and physicalLength properties.
Perform visual‑length‑aware substring operations, optionally limiting the physical byte length.
Compatibility Considerations
If a receiving system supports an older Unicode version, complex emojis may render as separate components (e.g., 👨💻 may appear as 👨💻). The library always treats the sequence as a single grapheme cluster according to the latest Unicode standard.
References
Full Emoji List v15.0 – unicode.org
Unicode Emoji Data – unicode.org/Public/emoji/15.0/
Emoji under the hood (translation) – taoshu.in
Unicode Emoji Regex – unicode.org
Alipay Experience Technology
Exploring ultimate user experience and best engineering practices
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
