Understanding Character Encoding: Bits, Bytes, Unicode, UTF-8, UTF-16, and UTF-32
This article explains the origins of character sets, the relationships among various encodings such as ASCII, GB2312, GBK, GB18030, Unicode, UTF-8, UTF-16, and UTF-32, and shows how JavaScript handles Unicode and emoji characters, including practical code examples and solutions for length‑limited input fields.
When encountering garbled text or unexpected emoji length in a limited input field, many developers wonder why this happens, what encoding JavaScript uses, and how different character sets relate to each other.
Basic concepts : a bit is the smallest storage unit (0 or 1), a byte consists of 8 bits, and a character is an abstract entity represented by various character sets or code pages.
In computers all information is ultimately a binary sequence. One byte can represent 256 different states, allowing a direct mapping between each 8‑bit value and a symbol.
Historical Chinese encodings : ASCII was created in the 1960s for English. China extended ASCII with GB2312, later expanded to GBK (adding 20,000 characters) and finally GB18030 to include minority scripts.
Unified character set : The ISO and Unicode Consortium collaborated to create a universal character set (UCS). Unicode assigns a unique code point to each symbol; the latest version defines over 109,000 symbols across 17 planes.
Encoding methods :
UCS‑4 uses 32 bits per code point (0–0x7FFFFFFF), but only values up to 0x10FFFF are used.
UTF‑32 is a subset of UCS‑4 limited to 0–0x10FFFF.
UTF‑16 encodes basic‑plane characters with 2 bytes and supplementary‑plane characters with 4 bytes (surrogate pairs).
UCS‑2 is an older 2‑byte fixed‑width encoding; JavaScript originally used UCS‑2 before adopting UTF‑16.
UTF‑8 encodes ASCII characters in a single byte and uses multi‑byte sequences for other symbols, making it bandwidth‑efficient.
Example conversion of a basic‑plane code point:
U+597D => 0x597DConversion of a supplementary‑plane code point (illustrated with JavaScript calculations):
H = Math.floor((c - 0x10000) / 0x400) + 0xD800 L = (c - 0x10000) % 0x400 + 0xDC00In JavaScript, emoji are represented as surrogate pairs (UCS‑2/UTF‑16). When a length‑limited textarea cuts an emoji in half, the result appears as garbled text.
To detect emoji (or any supplementary‑plane character) you can use a regular expression:
var patt = /[\ud800-\udbff][\udc00-\udfff]/g;Solution: avoid using the native maxlength attribute for textareas; instead, validate input length with JavaScript, checking whether the cut point falls inside an emoji surrogate pair and truncating before the emoji if necessary.
Summary : Unicode defines the character set, while UTF‑8, UTF‑16, UTF‑32, UCS‑2, and UCS‑4 are encoding schemes. Understanding their differences helps prevent encoding‑related bugs such as emoji truncation.
JD Tech
Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.