Why Does JavaScript .length Miscount Emoji? A Deep Dive into UTF‑16 and Unicode
This article explains why JavaScript's string length property returns unexpected values for Unicode characters like emojis, explores UTF‑16 encoding rules, and demonstrates modern ES6 techniques—including for‑of loops, spread syntax, and the \u{…} and /u regex flags—to correctly handle Unicode strings.
Source: juejin.cn/post/7025400771982131236
During development you may encounter encoding, Unicode, and Emoji issues; this article clarifies why string length behaves unexpectedly and how to handle it correctly.
In JavaScript, the .length property counts UTF‑16 code units, so characters outside the Basic Multilingual Plane (BMP) occupy two units. For example:
'吉'.length // 1
'𠮷'.length // 2
'❤'.length // 1
'💩'.length // 2ECMAScript strings use UTF‑16 encoding. The smallest UTF‑16 code unit is two bytes; BMP characters fit in one unit (U+0000–U+FFFF), while supplementary plane characters require a surrogate pair (four bytes, U+010000–U+10FFFF).
Encoding logic:
If the code point ≤ U+FFFF, use it directly.
Otherwise, compute a surrogate pair: ((cp‑65536) / 1024) + 0xD800 and ((cp‑65536) % 1024) + 0xDC00.
Example of a BMP character:
'\u0041' // -> A
'A' === '\u0041' // -> trueExample of a supplementary character (💩, U+1F4A9):
'\ud83d\udca9' // -> '💩'
'💩' === '\ud83d\udca9' // -> trueBoth \u{...} notation and surrogate pairs represent the same character:
'\u0041' === '\u{41}' // -> true
'\ud83d\udca9' === '\u{1f4a9}' // -> trueBecause .length counts code units, a character like 💩 is counted as 2. To obtain the visual length, many developers replace surrogate pairs with a placeholder before measuring:
const spRegexp = /[\uD800-\uDBFF][\uDC00-\uDFFF]/g;
if (str) { val = value.replace(spRegexp, '_').length; }ES6 introduced better Unicode support:
for…of iterates over actual Unicode characters, avoiding the double‑count issue of traditional for loops.
var str = '👻yo𠮷';
for (var i = 0; i < str.length; i++) { console.log(str[i]); }
// -> � � y o � �
for (const char of str) { console.log(char); }
// -> 👻 y o 𠮷Spread syntax also yields correct character count: [...'💩'].length // -> 1 Methods like slice, split, substr suffer the same double‑count problem for surrogate pairs.
The /u regex flag enables proper Unicode matching:
/^.$/.test('👻') // -> false
/^.$/u.test('👻') // -> true charCodeAtreturns the first code unit of a surrogate pair, while codePointAt returns the full code point:
'羽'.charCodeAt(0) // -> 32701
'羽'.codePointAt(0) // -> 32701
'😸'.charCodeAt(0) // -> 55357
'😸'.codePointAt(0) // -> 128568String equality can be affected by different Unicode normalizations. The String.prototype.normalize() method makes visually identical strings compare equal:
'cafe\u0301' === 'café' // -> false
'cafe\u0301'.normalize() === 'café'.normalize() // -> trueWhere ECMAScript operations interpret String values, each element is interpreted as a single UTF‑16 code unit.
In summary, JavaScript's legacy UTF‑16 handling leads to length and iteration anomalies for characters outside the BMP, but ES6 provides tools—such as for…of, spread syntax, the /u flag, codePointAt, and normalize() —to work with Unicode correctly.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
