Fundamentals 15 min read

Why Do Emoji Lengths Differ in JavaScript? Understanding Unicode, UTF‑8 & UTF‑16

This article explains why strings containing emojis report different lengths in JavaScript, covering Unicode fundamentals, code points, UTF‑8 and UTF‑16 encodings, surrogate pairs, grapheme clusters, zero‑width joiners, and modern ES2015‑ESNext features that help handle Unicode correctly.

ELab Team
ELab Team
ELab Team
Why Do Emoji Lengths Differ in JavaScript? Understanding Unicode, UTF‑8 & UTF‑16

Many developers wonder why a string with emojis can have different length results in JavaScript, why "😀".charAt(0) equals "👍".charAt(0), and how to determine the length of characters like 'a', '嗨', '𠮷', '💩', or '🤦🏻‍♂️'.

Unicode

Unicode is the industry standard that assigns a unique code point (e.g., U+4E25 for the Chinese character 严) to every symbol, but it does not dictate how those code points are stored in memory.

Before Unicode, different regions used various encodings such as ASCII or GB2312, leading to compatibility problems. Unicode unifies these encodings.

Key Terminology

Code point : The abstract number assigned to a symbol, written as U+xxxx.

Script : A collection of letters and symbols used by one or more writing systems.

Plane : Unicode is divided into 17 planes, each containing 65,536 code points. The Basic Multilingual Plane (BMP) holds most common characters; the Supplementary Multilingual Plane (SMP) holds less common symbols.

Code unit : The actual binary sequence stored in memory (e.g., a byte in UTF‑8 or two bytes in UTF‑16).

UTF‑8

UTF‑8 encodes each code unit as 8 bits and follows two rules:

If the symbol fits in one byte, the first bit is 0 and the remaining 7 bits hold the Unicode value (identical to ASCII).

For multi‑byte symbols (n > 1), the first byte starts with n 1‑bits followed by a 0, and each continuation byte starts with 10.

For example, the code point 0x4E25 falls into the three‑byte range, so its UTF‑8 representation is 1110xxxx 10xxxxxx 10xxxxxx.

UTF‑16

Most JavaScript engines use UTF‑16, where each code unit is 16 bits. BMP code points are stored in a single code unit; SMP code points require a surrogate pair (two code units).

Thus, the emoji "💩" (U+1F4A9) is stored as '\uD83D\uDCA9', making its .length equal to 2.

charAt() returns the character at a given index. — MDN

Grapheme Clusters

A grapheme is the smallest visual unit of text. Some graphemes consist of multiple code points, such as the letter é formed by 'e' (U+0065) plus a combining acute accent (U+0301).

Emoji Representation

Complex emojis like "🤦🏻‍♂️" involve a zero‑width joiner (ZWJ) and modifiers. The ZWJ (U+200D) joins separate emoji characters into a single displayed glyph.

Emoji modifiers change skin tone, and sequences can be built by concatenating code points with ZWJ.

Unicode in JavaScript

ES2015

String iteration (e.g., [...str] or Array.from(str)) is Unicode‑aware for most SMP characters but still fails for combined emojis, requiring external libraries such as Punycode.js .

Regular Expressions

Without the u flag, a pattern like /foo.bar/ does not match an emoji correctly. Adding the u flag enables proper Unicode matching.

ES2018 – Unicode Property Escapes

ES2018 introduces \p{...} syntax, allowing regexes to match characters by Unicode properties, such as scripts ( \p{Script=Han}) or emojis ( \p{Emoji}).

let input = `I'm chinese!我是中国人`;
console.log(input.match(/\p{Script=Han}+/u));

ESNext – Intl.Segmenter

The upcoming Intl.Segmenter API can segment strings into graphemes, words, or sentences. Using granularity: "grapheme" correctly treats "🤦🏻‍♂️" as a single segment.

let segmenter = new Intl.Segmenter("cn", {granularity: "grapheme"});
let input = "有几个字?🤦🏻‍♂️";
for (let {segment, index, isGraphemeLike} of segmenter.segment(input)) {
  console.log(`segment at [${index}, ${index + segment.length}): «${segment}»${isGraphemeLike ? " (grapheme-like)" : ""}`);
}

Practical Takeaways

Understand the difference between code points, code units, and grapheme clusters.

Know that JavaScript strings are UTF‑16 encoded, causing multi‑byte characters to appear as multiple .length units.

Use the u regex flag, Unicode property escapes, or Intl.Segmenter for accurate Unicode processing.

Unicode planes diagram
Unicode planes diagram
UTF‑8 encoding example
UTF‑8 encoding example
Zero‑width joiner example
Zero‑width joiner example
Emoji modifiers
Emoji modifiers
Firefighter emoji composition
Firefighter emoji composition
Intl.Segmenter output
Intl.Segmenter output
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

EmojiJavaScriptUnicodeUTF-8UTF-16string length
ELab Team
Written by

ELab Team

Sharing fresh technical insights

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.