Frontend Development 10 min read

Handling Unicode and Supplementary Characters in JavaScript

This article explains how JavaScript processes Unicode characters, demonstrates the limitations of legacy APIs like charCodeAt and fromCharCode with supplementary characters, and introduces modern methods such as codePointAt, fromCodePoint, Unicode escape syntax, surrogate pairs, and polyfills for full Unicode support.

ByteFE

Feb 10, 2021

Handling Unicode and Supplementary Characters in JavaScript

When dealing with Chinese and other Unicode characters in JavaScript, developers use Unicode‑related APIs.

Early JavaScript provided String.prototype.charCodeAt and String.fromCharCode to convert strings to UTF‑16 code units and back. For example:

const str = '中文';
console.log([...str].map(char => char.charCodeAt(0))); // [20013, 25991]

These methods work for BMP characters but fail for supplementary characters. Consider the Mahjong tile "🀄":

const str = '🀄';
console.log(str.charCodeAt(0)); // 55356
console.log(String.fromCharCode(55356)); // �

Supplementary characters require two UTF‑16 code units. The correct conversion is:

const str = '🀄';
console.log(str.charCodeAt(0), str.charCodeAt(1)); // 55356 56324
console.log(String.fromCharCode(55356, 56324)); // 🀄

Unicode defines 17 planes; plane 0 is the Basic Multilingual Plane (BMP) and other planes contain supplementary characters.

Since ES2015, JavaScript offers String.prototype.codePointAt and String.fromCodePoint which handle full code points:

const str = '🀄';
console.log(str.codePointAt(0)); // 126980
console.log(String.fromCodePoint(126980)); // 🀄

Unicode escape sequences use \uXXXX for BMP characters, but for supplementary characters the curly‑brace form \u{1F004} must be used.

console.log('\u4e2d\u6587'); // 中文
console.log('\u{1F004}'); // 🀄

Supplementary characters are represented by surrogate pairs: a high surrogate (U+D800–U+DBFF) followed by a low surrogate (U+DC00–U+DFFF). The article provides a manual getCodePoint implementation using charCodeAt to decode surrogate pairs:

function getCodePoint(str, idx = 0) {
  const code = str.charCodeAt(idx);
  if (code >= 0xD800 && code <= 0xDBFF) {
    const high = code;
    const low = str.charCodeAt(idx + 1);
    return ((high - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000;
  }
  return code;
}
console.log(getCodePoint('中')); // 20013
console.log(getCodePoint('🀄')); // 126980

A corresponding fromCodePoint polyfill builds a string from code points, handling both BMP and supplementary ranges.

function fromCodePoint(...codePoints) {
  let str = '';
  for (let i = 0; i < codePoints.length; i++) {
    const cp = codePoints[i];
    if (cp <= 0xFFFF) {
      str += String.fromCharCode(cp);
    } else {
      let point = cp - 0x10000;
      const high = (point >> 10) + 0xD800;
      const low = (point % 0x400) + 0xDC00;
      str += String.fromCharCode(high) + String.fromCharCode(low);
    }
  }
  return str;
}
console.log(fromCodePoint(126980, 20013)); // 🀄中

To count Unicode characters correctly, one can spread the string or use a RegExp with the u flag:

function getCodePointCount(str) {
  return [...str].length;
}
function getCodePointCount(str) {
  const result = str.match(/./gu);
  return result ? result.length : 0;
}

The article also shows how UTF‑8 encodes Unicode characters and demonstrates extracting the original code point from UTF‑8 bytes.

const buffer = new Buffer('中'); // <Buffer e4 b8 ad>
const byte1 = parseInt('E4', 16); // 228
const byte2 = parseInt('B8', 16); // 184
const byte3 = parseInt('AD', 16); // 173
const codePoint = (byte1 & 0xf) << 12 | (byte2 & 0x3f) << 6 | (byte3 & 0x3f);
console.log(codePoint); // 20013

Finally, a UTF‑8 based getCodePoint implementation and its counterpart fromCodePoint are provided, illustrating how to handle Unicode without relying on ES2015 APIs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

JavaScript Unicode UTF-8 codePointAt fromCodePoint Surrogate Pair

Written by

ByteFE

Cutting‑edge tech, article sharing, and practical insights from the ByteDance frontend team.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.