Fundamentals 9 min read

Why Does JavaScript .length Miscount Emoji? A Deep Dive into UTF‑16 and Unicode

This article explains why JavaScript's string length property returns unexpected values for Unicode characters like emojis, explores UTF‑16 encoding rules, and demonstrates modern ES6 techniques—including for‑of loops, spread syntax, and the \u{…} and /u regex flags—to correctly handle Unicode strings.

Programmer DD
Programmer DD
Programmer DD
Why Does JavaScript .length Miscount Emoji? A Deep Dive into UTF‑16 and Unicode

Source: juejin.cn/post/7025400771982131236

During development you may encounter encoding, Unicode, and Emoji issues; this article clarifies why string length behaves unexpectedly and how to handle it correctly.

In JavaScript, the .length property counts UTF‑16 code units, so characters outside the Basic Multilingual Plane (BMP) occupy two units. For example:

'吉'.length // 1
'𠮷'.length // 2
'❤'.length // 1
'💩'.length // 2

ECMAScript strings use UTF‑16 encoding. The smallest UTF‑16 code unit is two bytes; BMP characters fit in one unit (U+0000–U+FFFF), while supplementary plane characters require a surrogate pair (four bytes, U+010000–U+10FFFF).

Encoding logic:

If the code point ≤ U+FFFF, use it directly.

Otherwise, compute a surrogate pair: ((cp‑65536) / 1024) + 0xD800 and ((cp‑65536) % 1024) + 0xDC00.

Example of a BMP character:

'\u0041' // -> A
'A' === '\u0041' // -> true

Example of a supplementary character (💩, U+1F4A9):

'\ud83d\udca9' // -> '💩'
'💩' === '\ud83d\udca9' // -> true

Both \u{...} notation and surrogate pairs represent the same character:

'\u0041' === '\u{41}' // -> true
'\ud83d\udca9' === '\u{1f4a9}' // -> true

Because .length counts code units, a character like 💩 is counted as 2. To obtain the visual length, many developers replace surrogate pairs with a placeholder before measuring:

const spRegexp = /[\uD800-\uDBFF][\uDC00-\uDFFF]/g;
if (str) { val = value.replace(spRegexp, '_').length; }

ES6 introduced better Unicode support:

for…of iterates over actual Unicode characters, avoiding the double‑count issue of traditional for loops.

var str = '👻yo𠮷';
for (var i = 0; i < str.length; i++) { console.log(str[i]); }
// -> � � y o � �

for (const char of str) { console.log(char); }
// -> 👻 y o 𠮷

Spread syntax also yields correct character count: [...'💩'].length // -> 1 Methods like slice, split, substr suffer the same double‑count problem for surrogate pairs.

The /u regex flag enables proper Unicode matching:

/^.$/.test('👻') // -> false
/^.$/u.test('👻') // -> true
charCodeAt

returns the first code unit of a surrogate pair, while codePointAt returns the full code point:

'羽'.charCodeAt(0) // -> 32701
'羽'.codePointAt(0) // -> 32701
'😸'.charCodeAt(0) // -> 55357
'😸'.codePointAt(0) // -> 128568

String equality can be affected by different Unicode normalizations. The String.prototype.normalize() method makes visually identical strings compare equal:

'cafe\u0301' === 'café' // -> false
'cafe\u0301'.normalize() === 'café'.normalize() // -> true
Where ECMAScript operations interpret String values, each element is interpreted as a single UTF‑16 code unit.

In summary, JavaScript's legacy UTF‑16 handling leads to length and iteration anomalies for characters outside the BMP, but ES6 provides tools—such as for…of, spread syntax, the /u flag, codePointAt, and normalize() —to work with Unicode correctly.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

EmojiJavaScriptUnicodeUTF-16es6string length
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.