Fundamentals 31 min read

Why Does '😁'.length = 2? Unraveling Unicode, UTF‑8, and JavaScript String Mysteries

This article explains the fundamentals of character sets and encodings—including ASCII, GB2312, GBK, Unicode, UTF‑8, UTF‑16, and UCS‑2—shows how they affect JavaScript string length, emoji handling, and common pitfalls such as combining characters, endianness, and BOM issues.

ELab Team

Apr 12, 2023

Why Does '😁'.length = 2? Unraveling Unicode, UTF‑8, and JavaScript String Mysteries

This article is authored by members of the ByteDance Education Adult & Innovation Front‑End Team and is reproduced with permission.

Background

Common questions arise in daily development: why does '😁'.length equal 2, what is the relationship between Unicode and UTF‑8, how many bytes does a Chinese character occupy, why do we have big‑endian and little‑endian, why should UTF‑8 files be stored without a BOM, and what are the mysterious "锟斤拷" and "烫烫烫" strings that appear when data is garbled?

Character Sets and Encodings

According to the Wikipedia definition, character encoding is the process of assigning numbers to graphical characters so they can be stored, transmitted, and transformed by computers. The numeric values are called "code points" and together form a code space, code page, or character map.

ASCII Character Set

ASCII (American Standard Code for Information Interchange) uses a single‑byte scheme where the most‑significant bit is 0 and the remaining 7 bits represent 128 characters (0x00‑0x7F). Control characters occupy 0‑31 and 127, while printable characters occupy 32‑126.

Extended ASCII (EASCII) expands the high bit to 1, giving 256 possible values, but this still cannot cover many Latin‑based languages, leading to the ISO‑8859 family (Latin‑1, Latin‑2, etc.).

Chinese Character Sets

Because Chinese, Japanese, and Korean require thousands of characters, double‑byte encoding (DBCS) is used. GB series (GB2312, GBK, GB18030) are the main Chinese encodings.

GB/T 2312

GB2312 defines 6,763 Chinese characters plus 682 Latin, Greek, and Japanese kana symbols. Characters are organized into 94 zones, each containing 94 characters. Each character occupies two bytes: a high‑byte (zone number) and a low‑byte (position within the zone). The zone code is offset by 0xA0 to avoid ASCII control characters, resulting in the EUC‑CN internal code.

Zone code for "节" is 29‑58, so EUC encoding is <0xBD, 0xDA> (hex

0xBD

0xDA

GBK / GB18030

GBK expands the unused double‑byte space of GB2312 to 23,940 characters. GB18030 further extends the encoding to a variable 1‑2‑4 byte scheme, covering the entire Unicode range (up to 1.6 million code points) and adding support for CJK, minority scripts, and emojis.

Big5

Big5 is the traditional Taiwanese encoding for traditional Chinese characters.

Unicode Character Set

Unicode provides a universal code space using 16‑bit code units (U+0000‑U+10FFFF). The first plane (Plane 0) is the Basic Multilingual Plane (BMP) covering most common characters. Supplementary planes (1‑16) contain less‑used symbols, historic scripts, and emojis.

UTF‑32

UTF‑32 stores each code point in four bytes, which is simple but wasteful for ASCII‑only text.

UTF‑8

UTF‑8 is a variable‑length encoding using 1‑4 bytes per character. The first byte indicates the total length (leading bits 0, 110, 1110, 11110) and continuation bytes start with 10. It is backward compatible with ASCII, self‑synchronizing, and avoids the need for a BOM.

Example encoding of the character "兔" (U+5154):

Code point: U+5154

Binary: 101000101010100

Grouped into 6‑bit chunks: 101 000101 010100

Prefix with 1110 and continuation 10 bits → 11100101 10000101 10010100 Hexadecimal: E5 85 94 Encoding of the emoji "🐰" (U+1F430):

Code point: U+1F430

Binary: 11111010000110000

Grouped: 11111 010000 110000

Four‑byte form → 11110000 10011111 10010000 10110000 Hexadecimal: F0 9F 90 B0 In JavaScript, encodeURI('兔') yields %E5%85%94 and encodeURI('🐰') yields %F0%9F%90%B0.

UTF‑16

UTF‑16 also uses a variable‑length scheme with 2‑byte units for BMP characters and 4‑byte surrogate pairs for supplementary characters. Surrogate pairs consist of a high surrogate (0xD800‑0xDBFF) and a low surrogate (0xDC00‑0xDFFF).

Algorithm for a code point > 0xFFFF:

H = Math.floor((c - 0x10000) / 0x400) + 0xD800;
L = (c - 0x10000) % 0x400 + 0xDC00;

Example for "🐰" (U+1F430): high surrogate 0xD83D, low surrogate 0xDC30 → byte sequence D8 3D DC 30.

UCS‑2

UCS‑2 is the predecessor of UTF‑16, using a fixed 2‑byte representation and covering only BMP characters. UTF‑16 is a superset of UCS‑2.

Big‑Endian and Little‑Endian

When storing UTF‑16 or UTF‑32 data, the byte order matters. Big‑Endian stores the most‑significant byte first, while Little‑Endian stores the least‑significant byte first. A Byte Order Mark (BOM) (0xFEFF for BE, 0xFFFE for LE) can be used to indicate the order.

JavaScript and Unicode

Schrödinger's length

JavaScript strings are stored as UTF‑16 code units. The length property counts code units, not actual characters. Therefore, characters outside the BMP (e.g., "😁") are represented by a surrogate pair and have length === 2.

var s = "🐰";
s.length // 2
s.charAt(0) // '\uD83D'
s.charAt(1) // '\uDC30'

String Handling in ES6

ES6 introduces helpers that correctly handle Unicode code points: [...str].length or Array.from(str).length gives the true character count. String.fromCodePoint() creates a character from a code point. String.prototype.codePointAt() returns the code point of a character.

The u flag in regular expressions enables full Unicode support.

Composite Characters and Length

Some emojis are formed by multiple code points joined with a Zero‑Width Joiner (ZWJ, U+200D). For example, the family emoji "👨👩👧👦" consists of four base emojis plus three ZWJ characters, resulting in length === 11.

Combining characters (e.g., "e" + acute accent) are also represented by multiple code points. String.prototype.normalize() can convert a sequence to a canonical form, but it does not handle all ZWJ sequences.

Practical Pitfalls

When setting maxlength on an <input>, browsers differ: Chrome counts UTF‑16 code units, while Safari counts actual grapheme clusters. Libraries such as grapheme‑splitter can provide consistent results.

Regular expressions that need to match letters with diacritics can use Unicode property escapes ( \p{L}, \p{M}) or the XRegExp library for broader compatibility.

Databases must use utf8mb4 (or equivalent) to store emojis correctly.

Conclusion

Unicode encoding can be viewed in four layers:

Abstract Character Repertoire (the set of characters to encode).

Coded Character Set (mapping characters to code points).

Character Encoding Form (mapping code points to code units, e.g., UTF‑8, UTF‑16).

Character Encoding Scheme (mapping code units to byte sequences, handling endianness and BOM).

Understanding these layers helps avoid common bugs related to string length, emoji handling, and data storage.

JavaScript Unicode UTF-8 character encoding text processing string length

Written by

ELab Team

Sharing fresh technical insights

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.