Why Emoji Turn into Question Marks? Master Unicode Encoding and Fix Socket Transmission
This article explains why emojis become garbled when transmitted via sockets, explores Unicode encoding fundamentals—including UTF‑8, BMP and high‑code‑point characters—and provides practical solutions using codePointAt, TextEncoder, and TextDecoder to ensure correct emoji handling.
Background
1. Business background The company transmits string data via sockets; emojis entered at point A appear as question marks on iOS and as garbled text on PC.
2. Emoji garbling Emoji corruption is usually caused by improper handling of UTF‑8 encoding, as emojis require four bytes (high code‑point characters).
About Unicode Encoding
Unicode is a universal character encoding standard that assigns a unique code point to every character.
Basic Concepts
Character : a written symbol such as a letter, digit, or punctuation.
Code point : the unique identifier for a character, expressed as U+XXXX.
Encoding form : the method of converting a code point to a byte sequence (e.g., UTF‑8, UTF‑16, UTF‑32).
Unicode Planes
Unicode divides characters into planes of 65,536 code points each. Major planes include:
Basic Multilingual Plane (BMP) : U+0000‑U+FFFF, contains most common characters.
Supplementary Multilingual Plane (SMP) : U+10000‑U+1FFFF, includes historic scripts and musical symbols.
Supplementary Ideographic Plane (SIP) : U+20000‑U+2FFFF, mainly CJK ideographs.
Supplementary Special-purpose Plane (SSP) : U+E0000‑U+EFFFF, private‑use characters.
The BMP was designed to cover the majority of modern text processing needs.
High‑code‑point characters (e.g., 😀 U+1F600, 𐎀 U+10380, 𝄞 U+1D11E) lie outside the BMP.
Common Unicode Encoding Forms
UTF‑8 Variable‑length encoding using 1‑4 bytes per character. Backward compatible with ASCII. Widely used for network transmission and file storage.
UTF‑16 Uses 2 bytes for BMP characters and 4 bytes (surrogate pairs) for characters outside BMP.
UTF‑32 Fixed‑length 4‑byte encoding for every character. Simple but consumes more memory.
Root Cause Analysis
All endpoints fail to display emojis, indicating a data‑transfer issue.
The socket transmits binary data via an ArrayBuffer. The custom writeUTFBytes function converts a UTF‑8 string to a byte stream:
<code>/**
* 将 UTF-8 字符串写入字节流。类似于 writeUTF() 方法,但 writeUTFBytes() 不使用 16 位长度的字为字符串添加前缀。
* 对应的读取方法为:getUTFBytes 。
* @param value 要写入的字符串。
*/
public function writeUTFBytes(value:String):void {
// utf8-decode
value = value + "";
for (var i:int = 0, sz:int = value.length; i < sz; i++) {
var c:int = value.charCodeAt(i);
if (c <= 0x7F) {
writeByte(c);
} else if (c <= 0x7FF) {
_ensureWrite(this._pos_ + 2);
this._u8d_.set([0xC0 | (c >> 6), 0x80 | (c & 0x3F)], _pos_);
this._pos_ += 2;
} else if (c <= 0xFFFF) {
_ensureWrite(this._pos_ + 3);
this._u8d_.set([0xE0 | (c >> 12), 0x80 | ((c >> 6) & 0x3F), 0x80 | (c & 0x3F)], _pos_);
this._pos_ += 3;
} else {
_ensureWrite(this._pos_ + 4);
this._u8d_.set([0xF0 | (c >> 18), 0x80 | ((c >> 12) & 0x3F), 0x80 | ((c >> 6) & 0x3F), 0x80 | (c & 0x3F)], _pos_);
this._pos_ += 4;
}
}
}
</code>The function processes each character, determines its Unicode code point, and encodes it into 1‑4 bytes according to UTF‑8 rules.
In JavaScript, charCodeAt only returns the first 16 bits of a surrogate pair, so high‑code‑point emojis are split, causing corruption. The correct method is codePointAt :
<code>let str = "😀";
console.log(str.charCodeAt(0)); // 55357 (high surrogate)
console.log(str.charCodeAt(1)); // 56832 (low surrogate)
console.log(str.codePointAt(0)); // 128512 (full Unicode code point)
</code>Solutions
Two approaches:
Replace charCodeAt with codePointAt to obtain the full code point for emojis.
Use the browser‑provided TextDecoder and TextEncoder APIs for encoding/decoding.
TextDecoder converts binary data (Uint8Array or ArrayBuffer) to a string, supporting UTF‑8, UTF‑16, etc. Example:
<code>// Create a TextDecoder instance
const decoder = new TextDecoder('utf-8');
// Example Uint8Array
const uint8Array = new Uint8Array([0xe4, 0xbd, 0xa0, 0xe5, 0xa5, 0xbd]);
// Decode to string
const decodedString = decoder.decode(uint8Array);
console.log(decodedString); // 输出: 你好
</code>TextEncoder converts a string to a Uint8Array (UTF‑8 only). Example:
<code>// Create a TextEncoder instance
const encoder = new TextEncoder();
const string = '你好';
const encodedArray = encoder.encode(string);
console.log(encodedArray); // Uint8Array(6) [228, 189, 160, 229, 165, 189]
</code>Both APIs provide a simple, efficient way to handle text‑binary conversions, especially for diverse character encodings.
Conclusion
The article covered Unicode encoding basics and how improper handling leads to emoji garbling. Key takeaways:
Do not use charCodeAt for emojis; use codePointAt instead.
Prefer TextDecoder and TextEncoder for reliable encoding and decoding.
Code Mala Tang
Read source code together, write articles together, and enjoy spicy hot pot together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.