Fundamentals 7 min read

Understanding Unicode, UTF-16, and String Length Issues in JavaScript

This article explains why JavaScript string length behaves unexpectedly with Unicode characters, describes UTF‑16 encoding and surrogate pairs, and demonstrates ES6 techniques such as for‑of loops, spread syntax, the u regex flag, codePointAt, and normalize to handle Unicode correctly.

Architecture Digest
Architecture Digest
Architecture Digest
Understanding Unicode, UTF-16, and String Length Issues in JavaScript

During development, the author encountered inconsistencies in JavaScript string length when dealing with Unicode characters such as emojis and rare CJK characters.

JavaScript strings are encoded in UTF‑16, where each code unit occupies two bytes; characters in the Basic Multilingual Plane (BMP) use one code unit, while supplementary‑plane characters are represented by a surrogate pair (two code units), which explains why '𠮷'.length // 2 returns 2.

The article shows how to calculate the correct visual length by replacing surrogate pairs with a placeholder, e.g. const spRegexp = /[\uD800-\uDBFF][\uDC00-\uDFFF]/g; if (str) { val = str.replace(spRegexp, '_').length; } .

It also compares traditional for loops, which iterate over code units, with for...of , which iterates over actual Unicode code points, demonstrating the difference with examples like var str = '👻yo𠮷'; for (const ch of str) { console.log(ch); } producing 👻, y, o, 𠮷.

Other ES6 features that handle Unicode correctly are presented: the spread operator ( [...'💩'].length // 1 ), the u regex flag ( /^.$/.test('👻') vs /^.$/u.test('👻') ), String.prototype.codePointAt versus charCodeAt ( '😸'.charCodeAt(0) // 55357 vs '😸'.codePointAt(0) // 128568 ), and String.prototype.normalize() for canonical equivalence ( 'cafe\u0301'.normalize() === 'café'.normalize() // true ).

Finally, the author notes that the article is a personal learning note and includes a free book giveaway, but the technical content serves as a concise reference for Unicode handling in JavaScript.

JavaScriptUnicodeprogramming fundamentalsUTF-16ES6string length
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.