Handling Unicode and Supplementary Characters in JavaScript
This article explains how JavaScript processes Unicode characters, demonstrates the limitations of legacy APIs like charCodeAt and fromCharCode with supplementary characters, and introduces modern methods such as codePointAt, fromCodePoint, Unicode escape syntax, surrogate pairs, and polyfills for full Unicode support.
When dealing with Chinese and other Unicode characters in JavaScript, developers use Unicode‑related APIs.
Early JavaScript provided String.prototype.charCodeAt and String.fromCharCode to convert strings to UTF‑16 code units and back. For example:
const str = '中文';
console.log([...str].map(char => char.charCodeAt(0))); // [20013, 25991]These methods work for BMP characters but fail for supplementary characters. Consider the Mahjong tile "🀄":
const str = '🀄';
console.log(str.charCodeAt(0)); // 55356
console.log(String.fromCharCode(55356)); // �Supplementary characters require two UTF‑16 code units. The correct conversion is:
const str = '🀄';
console.log(str.charCodeAt(0), str.charCodeAt(1)); // 55356 56324
console.log(String.fromCharCode(55356, 56324)); // 🀄Unicode defines 17 planes; plane 0 is the Basic Multilingual Plane (BMP) and other planes contain supplementary characters.
Since ES2015, JavaScript offers String.prototype.codePointAt and String.fromCodePoint which handle full code points:
const str = '🀄';
console.log(str.codePointAt(0)); // 126980
console.log(String.fromCodePoint(126980)); // 🀄Unicode escape sequences use \uXXXX for BMP characters, but for supplementary characters the curly‑brace form \u{1F004} must be used.
console.log('\u4e2d\u6587'); // 中文
console.log('\u{1F004}'); // 🀄Supplementary characters are represented by surrogate pairs: a high surrogate (U+D800–U+DBFF) followed by a low surrogate (U+DC00–U+DFFF). The article provides a manual getCodePoint implementation using charCodeAt to decode surrogate pairs:
function getCodePoint(str, idx = 0) {
const code = str.charCodeAt(idx);
if (code >= 0xD800 && code <= 0xDBFF) {
const high = code;
const low = str.charCodeAt(idx + 1);
return ((high - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000;
}
return code;
}
console.log(getCodePoint('中')); // 20013
console.log(getCodePoint('🀄')); // 126980A corresponding fromCodePoint polyfill builds a string from code points, handling both BMP and supplementary ranges.
function fromCodePoint(...codePoints) {
let str = '';
for (let i = 0; i < codePoints.length; i++) {
const cp = codePoints[i];
if (cp <= 0xFFFF) {
str += String.fromCharCode(cp);
} else {
let point = cp - 0x10000;
const high = (point >> 10) + 0xD800;
const low = (point % 0x400) + 0xDC00;
str += String.fromCharCode(high) + String.fromCharCode(low);
}
}
return str;
}
console.log(fromCodePoint(126980, 20013)); // 🀄中To count Unicode characters correctly, one can spread the string or use a RegExp with the u flag:
function getCodePointCount(str) {
return [...str].length;
}
function getCodePointCount(str) {
const result = str.match(/./gu);
return result ? result.length : 0;
}The article also shows how UTF‑8 encodes Unicode characters and demonstrates extracting the original code point from UTF‑8 bytes.
const buffer = new Buffer('中'); //
const byte1 = parseInt('E4', 16); // 228
const byte2 = parseInt('B8', 16); // 184
const byte3 = parseInt('AD', 16); // 173
const codePoint = (byte1 & 0xf) << 12 | (byte2 & 0x3f) << 6 | (byte3 & 0x3f);
console.log(codePoint); // 20013Finally, a UTF‑8 based getCodePoint implementation and its counterpart fromCodePoint are provided, illustrating how to handle Unicode without relying on ES2015 APIs.
ByteFE
Cutting‑edge tech, article sharing, and practical insights from the ByteDance frontend team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.