Why encodeURIComponent Throws “URI malformed” and How Unicode Encoding Works
This article explains why encodeURIComponent can raise a URI malformed error, clarifies the concepts of high and low surrogate pairs, and provides a comprehensive overview of character sets, Unicode, and the UTF‑8 and UTF‑16 encodings used in JavaScript.
1. Understanding the “URI malformed” error
In front‑end development, using encodeURIComponent on a string that lacks a complete high‑low surrogate pair causes the JavaScript error “Uncaught URIError: URI malformed”. The string must contain both the high and low surrogate; otherwise the function throws.
encodeURIComponent('\uD800\uDFFF')
encodeURIComponent('\uD800')
encodeURIComponent('\uDFFF')
High and low surrogates are parts of a Unicode code point that cannot be represented with a single 16‑bit unit.
2. Character sets and encodings
A character set is a collection of characters with assigned numbers. Character encoding defines how those numbers are stored in bytes.
Most character sets encode themselves, but Unicode has multiple encodings such as UTF‑8, UTF‑16, and UTF‑32.
2.1 ASCII
ASCII uses one byte (8 bits) per character, covering 128 symbols (English letters, digits, common symbols, control codes).
2.2 ISO‑8859‑1
ISO‑8859‑1 extends ASCII by fixing the most significant bit to 1, providing another 128 code points for Latin‑1 symbols.
2.3 GB2312 – Simplified Chinese
GB2312 defines 94 × 94 = 8 836 double‑byte code points. The high byte represents the “row” and the low byte the “cell”. The conversion adds 0xA0 to each hexadecimal value to ensure the leading bit is 1.
Example: the character “字” has row 55 (0x37) and cell 54 (0x36); after adding 0xA0 the GB2312 code is 0xD7D6.
2.4 GBK – Extension for Traditional Chinese
GBK also uses double‑byte storage, with high byte range 0x81‑0xFE and low byte 0x40‑0xFE (excluding 0x7F), supporting 23 940 characters and being backward compatible with GB2312.
2.5 Unicode – Universal Character Set
Unicode assigns code points from 0x000000 to 0x10FFFF across 17 planes. The first plane (Basic Multilingual Plane) contains the “surrogate” range 0xD800‑0xDFFF, which is reserved and cannot be encoded as a single UTF‑16 unit.
2.5.1 UTF‑8
UTF‑8 is a variable‑length encoding using 1‑4 bytes. Bytes whose most‑significant bit is 0 represent ASCII; bytes starting with 110, 1110, or 11110 indicate 2‑, 3‑, or 4‑byte sequences respectively. Example: the character “汉” (U+6C49) becomes the three‑byte sequence E6 B1 89 (hex).
2.5.2 UTF‑16
UTF‑16 uses one 16‑bit unit for code points in the BMP and a surrogate pair (two 16‑bit units) for supplementary planes. The high surrogate range is 0xD800‑0xDBFF and the low surrogate range is 0xDC00‑0xDFFF.
Conversion example for U+16904: subtract 0x10000 → 0x06904, split into high 0x001A and low 0x0104, then add 0xD800 and 0xDC00 to obtain the pair 0xD81A 0xDD04.
3. UTF‑16 encoding in JavaScript
In JavaScript, a character can be written as “\uXXXX”. This literal represents the UTF‑16 code unit. Because UTF‑16 cannot contain isolated surrogates, strings like “\uD800” are illegal and cause encodeURIComponent to throw “URI malformed”. A valid surrogate pair such as “\uD800\uDFFF” is allowed.
When browsers automatically apply encodeURIComponent to form data or Ajax payloads, any illegal UTF‑16 characters will trigger the error.
Reference: “Character Encoding – A Complete Guide”, http://www.cnblogs.com/leesf456/p/5317574.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
