The Mystery of Character Encoding: Unicode, UTF‑8, UTF‑16, GBK and Emoji
This article explains the fundamentals of character encoding, covering Unicode’s universal character set, the structure of its planes and surrogate areas, the variable‑length UTF‑8 and UTF‑16 encodings, Chinese‑specific GBK encoding, and practical iOS code examples for handling Unicode, emojis and regular‑expression based Chinese character detection.
The article begins by introducing common character encodings such as Unicode , UTF‑8 , GBK and the emoji character set, explaining why a unified encoding scheme is needed beyond the original 128‑character ASCII.
It then describes the early ASCII encoding ( byte → ASCII ) and the limitations that led to national encodings like gb2312 , eventually converging on the universal Unicode standard, which maps characters to code points ranging from 0 to 0x10FFFF, giving a total of 1,114,112 possible code points.
The article explains that Unicode is organized into 17 planes, each containing 65,536 code points. Plane 0 is the Basic Multilingual Plane (BMP) and contains most common characters, while planes 1‑16 are accessed via surrogate pairs in UTF‑16. The surrogate range (0xD800‑0xDFFF) is used to represent characters outside the BMP.
It details how UTF‑8 encodes Unicode code points using 1‑4 bytes, showing the example of the Chinese character “汉” (Unicode 0x6C49) encoded as the three‑byte sequence E6 B1 89 . A regular‑expression snippet for detecting Chinese characters is provided:
NSPredicate* predicate = [NSPredicate predicateWithFormat:@"SELF MATCHES %@", @"[\u4e00-\u9fa5]"];
if ([predicate evaluateWithObject:name]) {
// is Chinese
} else {
// not Chinese
}The article then explores surrogate handling with emojis, showing that the smiling face 😊 is represented in UTF‑16 as the surrogate pair D83D‑DE03 and in UTF‑8 as the four‑byte sequence F0 9F 98 83 . It explains how to reconstruct the original Unicode code point from the surrogate pair using the formula 0x10000 + (lead‑0xD800)*0x400 + (trail‑0xDC00) .
For iOS developers, the article demonstrates how to obtain Unicode and UTF‑8 byte representations of a string using NSString methods:
NSData* data = [str dataUsingEncoding:NSUnicodeStringEncoding];
NSData* utf8Data = [str dataUsingEncoding:NSUTF8StringEncoding];It also shows how to create a GBK encoding constant:
NSStringEncoding gbkEncoding = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingGB_18030_2000);Finally, the article summarizes key points: GBK is a superset of GB2312; GB18030 extends GBK and is compatible with Unicode; UTF‑8 is a transformation format, not a character set; and proper conversion between these encodings is essential for correct text handling across platforms.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.