Why Does 'đ'.length = 2? Unraveling Unicode, UTFâ8, and JavaScript String Mysteries
This article explains the fundamentals of character sets and encodingsâincluding ASCII, GB2312, GBK, Unicode, UTFâ8, UTFâ16, and UCSâ2âshows how they affect JavaScript string length, emoji handling, and common pitfalls such as combining characters, endianness, and BOM issues.
This article is authored by members of the ByteDance Education Adult & Innovation FrontâEnd Team and is reproduced with permission.
Background
Common questions arise in daily development: why does 'đ'.length equal 2, what is the relationship between Unicode and UTFâ8, how many bytes does a Chinese character occupy, why do we have bigâendian and littleâendian, why should UTFâ8 files be stored without a BOM, and what are the mysterious "鿀æ·" and "ç«ç«ç«" strings that appear when data is garbled?
Character Sets and Encodings
According to the Wikipedia definition, character encoding is the process of assigning numbers to graphical characters so they can be stored, transmitted, and transformed by computers. The numeric values are called "code points" and together form a code space, code page, or character map.
ASCII Character Set
ASCII (American Standard Code for Information Interchange) uses a singleâbyte scheme where the mostâsignificant bit is 0 and the remaining 7 bits represent 128 characters (0x00â0x7F). Control characters occupy 0â31 and 127, while printable characters occupy 32â126.
Extended ASCII (EASCII) expands the high bit to 1, giving 256 possible values, but this still cannot cover many Latinâbased languages, leading to the ISOâ8859 family (Latinâ1, Latinâ2, etc.).
Chinese Character Sets
Because Chinese, Japanese, and Korean require thousands of characters, doubleâbyte encoding (DBCS) is used. GB series (GB2312, GBK, GB18030) are the main Chinese encodings.
GB/T 2312
GB2312 defines 6,763 Chinese characters plus 682 Latin, Greek, and Japanese kana symbols. Characters are organized into 94 zones, each containing 94 characters. Each character occupies two bytes: a highâbyte (zone number) and a lowâbyte (position within the zone). The zone code is offset by 0xA0 to avoid ASCII control characters, resulting in the EUCâCN internal code.
Zone code for "è" is 29â58, so EUC encoding is <0xBD, 0xDA> (hex
0xBD 0xDA).
GBK / GB18030
GBK expands the unused doubleâbyte space of GB2312 to 23,940 characters. GB18030 further extends the encoding to a variable 1â2â4 byte scheme, covering the entire Unicode range (up to 1.6âŻmillion code points) and adding support for CJK, minority scripts, and emojis.
Big5
Big5 is the traditional Taiwanese encoding for traditional Chinese characters.
Unicode Character Set
Unicode provides a universal code space using 16âbit code units (U+0000âU+10FFFF). The first plane (PlaneâŻ0) is the Basic Multilingual Plane (BMP) covering most common characters. Supplementary planes (1â16) contain lessâused symbols, historic scripts, and emojis.
UTFâ32
UTFâ32 stores each code point in four bytes, which is simple but wasteful for ASCIIâonly text.
UTFâ8
UTFâ8 is a variableâlength encoding using 1â4 bytes per character. The first byte indicates the total length (leading bits 0, 110, 1110, 11110) and continuation bytes start with 10. It is backward compatible with ASCII, selfâsynchronizing, and avoids the need for a BOM.
Example encoding of the character "ć " (U+5154):
Code point: U+5154
Binary: 101000101010100
Grouped into 6âbit chunks: 101âŻ000101âŻ010100
Prefix with 1110 and continuation 10 bits â 11100101 10000101 10010100 Hexadecimal: E5 85 94 Encoding of the emoji "đ°" (U+1F430):
Code point: U+1F430
Binary: 11111010000110000
Grouped: 11111âŻ010000âŻ110000
Fourâbyte form â 11110000 10011111 10010000 10110000 Hexadecimal: F0 9F 90 B0 In JavaScript, encodeURI('ć
') yields %E5%85%94 and encodeURI('đ°') yields %F0%9F%90%B0.
UTFâ16
UTFâ16 also uses a variableâlength scheme with 2âbyte units for BMP characters and 4âbyte surrogate pairs for supplementary characters. Surrogate pairs consist of a high surrogate (0xD800â0xDBFF) and a low surrogate (0xDC00â0xDFFF).
Algorithm for a code point > 0xFFFF:
H = Math.floor((c - 0x10000) / 0x400) + 0xD800;
L = (c - 0x10000) % 0x400 + 0xDC00;Example for "đ°" (U+1F430): high surrogate 0xD83D, low surrogate 0xDC30 â byte sequence D8 3D DC 30.
UCSâ2
UCSâ2 is the predecessor of UTFâ16, using a fixed 2âbyte representation and covering only BMP characters. UTFâ16 is a superset of UCSâ2.
BigâEndian and LittleâEndian
When storing UTFâ16 or UTFâ32 data, the byte order matters. BigâEndian stores the mostâsignificant byte first, while LittleâEndian stores the leastâsignificant byte first. A Byte Order Mark (BOM) (0xFEFF for BE, 0xFFFE for LE) can be used to indicate the order.
JavaScript and Unicode
Schrödinger's length
JavaScript strings are stored as UTFâ16 code units. The length property counts code units, not actual characters. Therefore, characters outside the BMP (e.g., "đ") are represented by a surrogate pair and have length === 2.
var s = "đ°";
s.length // 2
s.charAt(0) // '\uD83D'
s.charAt(1) // '\uDC30'String Handling in ES6
ES6 introduces helpers that correctly handle Unicode code points: [...str].length or Array.from(str).length gives the true character count. String.fromCodePoint() creates a character from a code point. String.prototype.codePointAt() returns the code point of a character.
The u flag in regular expressions enables full Unicode support.
Composite Characters and Length
Some emojis are formed by multiple code points joined with a ZeroâWidth Joiner (ZWJ, U+200D). For example, the family emoji "đšđ©đ§đŠ" consists of four base emojis plus three ZWJ characters, resulting in length === 11.
Combining characters (e.g., "e" + acute accent) are also represented by multiple code points. String.prototype.normalize() can convert a sequence to a canonical form, but it does not handle all ZWJ sequences.
Practical Pitfalls
When setting maxlength on an <input>, browsers differ: Chrome counts UTFâ16 code units, while Safari counts actual grapheme clusters. Libraries such as graphemeâsplitter can provide consistent results.
Regular expressions that need to match letters with diacritics can use Unicode property escapes ( \p{L}, \p{M}) or the XRegExp library for broader compatibility.
Databases must use utf8mb4 (or equivalent) to store emojis correctly.
Conclusion
Unicode encoding can be viewed in four layers:
Abstract Character Repertoire (the set of characters to encode).
Coded Character Set (mapping characters to code points).
Character Encoding Form (mapping code points to code units, e.g., UTFâ8, UTFâ16).
Character Encoding Scheme (mapping code units to byte sequences, handling endianness and BOM).
Understanding these layers helps avoid common bugs related to string length, emoji handling, and data storage.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
