A Comprehensive History and Overview of Character Encoding and Unicode
The article traces character encoding from early telegraph and Morse code through ASCII, ISO national sets and Chinese standards, explains Unicode’s unification and its UTF‑8/‑16/‑32 forms, and shows how modern languages—especially JavaScript—handle code points, highlighting the cultural and technical significance for developers.
Every programmer should understand character encoding. After grasping the basic concepts, one can better comprehend programming languages and character handling. This article presents a detailed exploration of the evolution of character encoding, from early telegraph codes to modern Unicode and its implementation in JavaScript.
1. Origin
While researching Babel's source code, I encountered a snippet from Acorn's lexer that manipulates character codes. This led me to investigate the historical background of character encoding.
In the early days of emojis, developers noticed that some emojis have a length of 2 in JavaScript, and the infamous "𠮷" character sparked further curiosity.
2. Early History of Character Encoding
(1) The Telegraph Era
Before the invention of the telegraph, long‑distance communication relied on couriers, carrier pigeons, or signal fires, all of which were costly. In the 18th century, researchers began studying electricity and its potential for transmitting messages.
(2) Morse Code – The First Digital Communication
Samuel Morse invented Morse code in 1836, leading to the first operational telegraph line in 1839. Morse code uses dots and dashes to represent characters, and a lookup table can decode the signals.
The earliest Chinese telegraph codes used four‑digit numbers (0001‑9999) to represent up to ten thousand characters, digits, and symbols.
(3) ASCII – The Birth of Computer Character Encoding
In 1946 the ENIAC computer was built, but it could not represent characters. In 1963 the American National Standards Institute (ANSI) released the ASCII encoding scheme.
All data in computers is stored and processed as binary. ASCII defines 128 (or 256 with the 8‑bit extension) characters using 7‑ or 8‑bit binary numbers, covering English letters, digits, punctuation, and control characters.
3. The Chaotic Era of Non‑English Character Sets
(1) ISO/IEC 646 – The 7‑Bit Struggle
ISO/IEC 646 (1972) allowed national variants to replace certain ASCII symbols with local diacritics, e.g., apostrophe for acute accent, backquote for grave accent, double quote for diaeresis, caret for circumflex, swung dash for tilde, and comma for cedilla.
Because only 7 bits were available, many countries had to replace standard ASCII symbols with their own versions.
(2) ISO 2022 – An 8‑Bit Compatibility Scheme
ISO 2022 (based on ECMA‑35, 1971) introduced a framework for 7‑bit and 8‑bit character sets, defining control character groups (C0, C1) and graphic character groups (G0‑G3). It enabled 94×94 double‑byte sets for CJK languages.
(3) ISO/IEC 8859 – Latin‑Based 8‑Bit Sets
From 1985 onward, ISO/IEC 8859 parts 1‑16 standardized 8‑bit extensions of ASCII for various Latin alphabets, Cyrillic, Greek, etc.
4. Chinese Character Sets
GB2312 (GB/T 2312‑80) defines 6 763 Chinese characters plus Latin, Greek, Cyrillic, and Japanese kana, organized into 94 zones of 94 characters each. Zones 01‑09 contain symbols, digits, and Latin letters; zones 16‑55 hold the 3 755 common characters; zones 56‑87 hold the 3 008 less‑common characters.
To avoid conflict with ASCII control characters, GB2312 adds 0xA0 to each byte, resulting in the so‑called “inner code”.
5. The Unification Era – Unicode and ISO 10646
ISO 10646 initially defined UCS‑4, a 4‑byte code space capable of representing over four billion code points. However, hardware manufacturers preferred a 16‑bit solution. The Unicode consortium (originating from Xerox, Apple, Sun, Microsoft, etc.) proposed a 16‑bit encoding (UCS‑2) that later aligned with ISO 10646‑1 (1993).
Unicode now defines multiple planes. Plane 0 (Basic Multilingual Plane, BMP) covers U+0000‑U+FFFF. Supplementary planes extend up to U+10FFFF, requiring surrogate pairs in UTF‑16.
Unicode’s adoption has not been without controversy: merging variant glyphs, inclusion of obscure or erroneous characters, and the political implications of a universal script.
6. Unicode Implementation – UTF Transformations
UTF‑32
UTF‑32 uses a fixed 32‑bit integer for each code point, allowing constant‑time indexing but wasting space (four bytes per character).
UTF‑16
UTF‑16 encodes characters as one or two 16‑bit code units. Characters outside the BMP are represented by surrogate pairs: a high surrogate (0xD800‑0xDBFF) and a low surrogate (0xDC00‑0xDFFF). The conversion formula is:
codePoint = (high - 0xD800) * 0x400 + (low - 0xDC00) + 0x10000UTF‑16 is used by Windows APIs, Java, JavaScript, and many other platforms.
UTF‑8
UTF‑8 is a variable‑length encoding (1‑4 bytes) that is backward compatible with ASCII. It is now the dominant encoding for web pages, emails, and most modern software.
7. How JavaScript Handles Characters
ECMAScript 5.1 states that implementations must conform to Unicode 3.0+ (or ISO 10646‑1) using either UCS‑2 or UTF‑16, defaulting to UTF‑16. Consequently, characters in the BMP have a length of 1, while characters in supplementary planes (e.g., 😂 U+1F602, 𠮷 U+20BB7) have a length of 2.
Example: the Acorn lexer code that calculates a full character code point:
pp.fullCharCodeAtPos = function() {
let code = this.input.charCodeAt(this.pos);
if (code <= 0xd7ff || code >= 0xdc00) return code;
let next = this.input.charCodeAt(this.pos + 1);
return next <= 0xdbff || next >= 0xe000 ? code : (code << 10) + next - 0x35fdc00;
}Polyfill for String.prototype.codePointAt (Mathias Bynens):
if (!String.prototype.codePointAt) {
(function() {
'use strict';
var codePointAt = function(position) {
if (this == null) { throw TypeError(); }
var string = String(this);
var size = string.length;
var index = Number(position) || 0;
if (index != index) { index = 0; }
if (index < 0 || index >= size) { return undefined; }
var first = string.charCodeAt(index);
var second;
if (first >= 0xD800 && first <= 0xDBFF && size > index + 1) {
second = string.charCodeAt(index + 1);
if (second >= 0xDC00 && second <= 0xDFFF) {
return (first - 0xD800) * 0x400 + second - 0xDC00 + 0x10000;
}
}
return first;
};
if (Object.defineProperty) {
Object.defineProperty(String.prototype, 'codePointAt', { value: codePointAt, configurable: true, writable: true });
} else {
String.prototype.codePointAt = codePointAt;
}
}());
}Modern languages such as Go and Rust use UTF‑8 for their native string types, avoiding the surrogate‑pair complications.
8. Conclusion
Character encoding intertwines technology, culture, and politics. Understanding its history helps developers navigate the quirks of Unicode, UTF‑8, UTF‑16, and related standards, and write robust, internationalized code.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.