Understanding Unicode, UTF-16, and String Length Issues in JavaScript
This article explains why JavaScript string length behaves unexpectedly with Unicode characters, describes UTF‑16 encoding and surrogate pairs, and demonstrates ES6 techniques such as for‑of loops, spread syntax, the u regex flag, codePointAt, and normalize to handle Unicode correctly.
During development, the author encountered inconsistencies in JavaScript string length when dealing with Unicode characters such as emojis and rare CJK characters.
JavaScript strings are encoded in UTF‑16, where each code unit occupies two bytes; characters in the Basic Multilingual Plane (BMP) use one code unit, while supplementary‑plane characters are represented by a surrogate pair (two code units), which explains why '𠮷'.length // 2 returns 2.
The article shows how to calculate the correct visual length by replacing surrogate pairs with a placeholder, e.g. const spRegexp = /[\uD800-\uDBFF][\uDC00-\uDFFF]/g; if (str) { val = str.replace(spRegexp, '_').length; } .
It also compares traditional for loops, which iterate over code units, with for...of , which iterates over actual Unicode code points, demonstrating the difference with examples like var str = '👻yo𠮷'; for (const ch of str) { console.log(ch); } producing 👻, y, o, 𠮷.
Other ES6 features that handle Unicode correctly are presented: the spread operator ( [...'💩'].length // 1 ), the u regex flag ( /^.$/.test('👻') vs /^.$/u.test('👻') ), String.prototype.codePointAt versus charCodeAt ( '😸'.charCodeAt(0) // 55357 vs '😸'.codePointAt(0) // 128568 ), and String.prototype.normalize() for canonical equivalence ( 'cafe\u0301'.normalize() === 'café'.normalize() // true ).
Finally, the author notes that the article is a personal learning note and includes a free book giveaway, but the technical content serves as a concise reference for Unicode handling in JavaScript.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.