Fundamentals 11 min read

Why Emoji Turn into Question Marks? Master Unicode Encoding and Fix Socket Transmission

This article explains why emojis become garbled when transmitted via sockets, explores Unicode encoding fundamentals—including UTF‑8, BMP and high‑code‑point characters—and provides practical solutions using codePointAt, TextEncoder, and TextDecoder to ensure correct emoji handling.

Code Mala Tang
Code Mala Tang
Code Mala Tang
Why Emoji Turn into Question Marks? Master Unicode Encoding and Fix Socket Transmission

Background

1. Business background The company transmits string data via sockets; emojis entered at point A appear as question marks on iOS and as garbled text on PC.

2. Emoji garbling Emoji corruption is usually caused by improper handling of UTF‑8 encoding, as emojis require four bytes (high code‑point characters).

About Unicode Encoding

Unicode is a universal character encoding standard that assigns a unique code point to every character.

Basic Concepts

Character : a written symbol such as a letter, digit, or punctuation.

Code point : the unique identifier for a character, expressed as U+XXXX.

Encoding form : the method of converting a code point to a byte sequence (e.g., UTF‑8, UTF‑16, UTF‑32).

Unicode Planes

Unicode divides characters into planes of 65,536 code points each. Major planes include:

Basic Multilingual Plane (BMP) : U+0000‑U+FFFF, contains most common characters.

Supplementary Multilingual Plane (SMP) : U+10000‑U+1FFFF, includes historic scripts and musical symbols.

Supplementary Ideographic Plane (SIP) : U+20000‑U+2FFFF, mainly CJK ideographs.

Supplementary Special-purpose Plane (SSP) : U+E0000‑U+EFFFF, private‑use characters.

The BMP was designed to cover the majority of modern text processing needs.

High‑code‑point characters (e.g., 😀 U+1F600, 𐎀 U+10380, 𝄞 U+1D11E) lie outside the BMP.

Common Unicode Encoding Forms

UTF‑8 Variable‑length encoding using 1‑4 bytes per character. Backward compatible with ASCII. Widely used for network transmission and file storage.

UTF‑16 Uses 2 bytes for BMP characters and 4 bytes (surrogate pairs) for characters outside BMP.

UTF‑32 Fixed‑length 4‑byte encoding for every character. Simple but consumes more memory.

Root Cause Analysis

All endpoints fail to display emojis, indicating a data‑transfer issue.

The socket transmits binary data via an ArrayBuffer. The custom writeUTFBytes function converts a UTF‑8 string to a byte stream:

<code>/**
 * 将 UTF-8 字符串写入字节流。类似于 writeUTF() 方法,但 writeUTFBytes() 不使用 16 位长度的字为字符串添加前缀。
 * 对应的读取方法为:getUTFBytes 。
 * @param value 要写入的字符串。
 */
public function writeUTFBytes(value:String):void {
    // utf8-decode
    value = value + "";
    for (var i:int = 0, sz:int = value.length; i < sz; i++) {
        var c:int = value.charCodeAt(i);
        if (c <= 0x7F) {
            writeByte(c);
        } else if (c <= 0x7FF) {
            _ensureWrite(this._pos_ + 2);
            this._u8d_.set([0xC0 | (c >> 6), 0x80 | (c & 0x3F)], _pos_);
            this._pos_ += 2;
        } else if (c <= 0xFFFF) {
            _ensureWrite(this._pos_ + 3);
            this._u8d_.set([0xE0 | (c >> 12), 0x80 | ((c >> 6) & 0x3F), 0x80 | (c & 0x3F)], _pos_);
            this._pos_ += 3;
        } else {
            _ensureWrite(this._pos_ + 4);
            this._u8d_.set([0xF0 | (c >> 18), 0x80 | ((c >> 12) & 0x3F), 0x80 | ((c >> 6) & 0x3F), 0x80 | (c & 0x3F)], _pos_);
            this._pos_ += 4;
        }
    }
}
</code>

The function processes each character, determines its Unicode code point, and encodes it into 1‑4 bytes according to UTF‑8 rules.

In JavaScript, charCodeAt only returns the first 16 bits of a surrogate pair, so high‑code‑point emojis are split, causing corruption. The correct method is codePointAt :

<code>let str = "😀";
console.log(str.charCodeAt(0)); // 55357 (high surrogate)
console.log(str.charCodeAt(1)); // 56832 (low surrogate)
console.log(str.codePointAt(0)); // 128512 (full Unicode code point)
</code>

Solutions

Two approaches:

Replace charCodeAt with codePointAt to obtain the full code point for emojis.

Use the browser‑provided TextDecoder and TextEncoder APIs for encoding/decoding.

TextDecoder converts binary data (Uint8Array or ArrayBuffer) to a string, supporting UTF‑8, UTF‑16, etc. Example:

<code>// Create a TextDecoder instance
const decoder = new TextDecoder('utf-8');
// Example Uint8Array
const uint8Array = new Uint8Array([0xe4, 0xbd, 0xa0, 0xe5, 0xa5, 0xbd]);
// Decode to string
const decodedString = decoder.decode(uint8Array);
console.log(decodedString); // 输出: 你好
</code>

TextEncoder converts a string to a Uint8Array (UTF‑8 only). Example:

<code>// Create a TextEncoder instance
const encoder = new TextEncoder();
const string = '你好';
const encodedArray = encoder.encode(string);
console.log(encodedArray); // Uint8Array(6) [228, 189, 160, 229, 165, 189]
</code>

Both APIs provide a simple, efficient way to handle text‑binary conversions, especially for diverse character encodings.

Conclusion

The article covered Unicode encoding basics and how improper handling leads to emoji garbling. Key takeaways:

Do not use charCodeAt for emojis; use codePointAt instead.

Prefer TextDecoder and TextEncoder for reliable encoding and decoding.

emojiUnicodeUTF-8SocketTextDecoderTextEncodercodePointAt
Code Mala Tang
Written by

Code Mala Tang

Read source code together, write articles together, and enjoy spicy hot pot together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.