Understanding URL Encoding: escape, encodeURI, encodeURIComponent, Percent‑Encoding, ASCII, Unicode and UTF‑8
This article explains the differences between JavaScript's escape, encodeURI and encodeURIComponent functions, the principles of percent‑encoding, the classification of reserved, unreserved and unsafe characters, and provides an overview of ASCII, Unicode and UTF‑8 character encodings.
The World of URL Encoding Is Fascinating, Take a Look
The article begins with an introduction to the JDC Multi‑Terminal R&D Lab, which focuses on front‑end capabilities such as web, mini‑programs, games and H5 animations.
1. Starting with escape and encodeURI
Assuming you already know how escape works:
It does not encode ASCII letters and digits.
It does not encode the characters *@-_+./ .
All other characters are replaced by escape sequences.
escape('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789')
// "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"
escape('*@-_+./')
// "*@-_+./"Assuming you already know how encodeURI works:
It does not encode ASCII letters and digits.
It does not encode the 20 ASCII punctuation characters -_.!~*'();/?:@&=+$,# .
All other characters are replaced by escape sequences.
encodeURI('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789')
// "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"
encodeURI('-_.!~*\'();/?:@&=+$,#')
// "-_.!~*'();/?:@&=+$,#"Both functions treat the string "凹凸" differently:
escape('凹凸')
// "%u51F9%u51F8"
encodeURI('凹凸')
// "%E5%87%B9%E5%87%B8"2. Percent‑Encoding
Percent‑encoding (also called URL encoding) represents characters as a percent sign followed by two hexadecimal digits. The key difference is that encodeURI follows the W3C standard (RFC 3986), while escape is non‑standard.
Common point: ASCII characters that need encoding are represented as %xx .
Standard (encodeURI): Non‑ASCII characters are first converted to UTF‑8 bytes, then each byte is percent‑encoded.
Non‑standard (escape): Non‑ASCII characters are encoded as %uxxxx , where xxxx is the Unicode code point in four hex digits.
3. Reserved, Unreserved and Unsafe Characters
According to RFC 3986, a URL may contain unreserved characters (letters, digits, -_.~ ) and reserved characters. Unreserved characters do not need percent‑encoding.
Reserved characters have special meanings (e.g., :/?#[]@!$&'()*+,;= ). Some characters are considered unsafe and should be encoded because they can cause ambiguity or be altered by proxies.
Unsafe Character
Why Unsafe
Example
%Used as the escape marker, therefore it must be encoded.
encodeURI('%') // "%25"Space
Spaces may be introduced or stripped during transmission.
encodeURI(' ') // "%20" <>"Angle brackets and quotes are often used to delimit URLs in text.
encodeURI('<>"') // "%3C%3E%22"{}|\^~[]'
Some gateways or proxies may tamper with these characters.
encodeURI("{}|\\^~[]'") // "%7B%7D%26%7C%5E~%5B%5D'"0x00‑0x1F, 0x7F
Control characters are non‑printable.
e.g., line feed 0x0A
>0x7F
Characters outside the 7‑bit ASCII range.
encodeURI('京东') // "%E4%BA%AC%E4%B8%9C"Thus encodeURI does not encode 82 characters: 66 unreserved + 18 reserved, minus the two unsafe reserved characters [] .
4. encodeURI vs. encodeURIComponent
encodeURIComponent assumes the input is a component of a URI (e.g., query string) and therefore encodes characters that separate URI parts. Its unencoded set contains only 71 characters.
encodeURIComponent('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789,/?:@&=+$#')
// "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789%2C%2F%3F%3A%40%26%3D%2B%24%23"Example comparison:
encodeURIComponent('https://aotu.io/') // "https%3A%2F%2Faotu.io%2F"
encodeURI('https://aotu.io/') // "https://aotu.io/"5. Character Encodings
ASCII : 1 byte = 8 bits, allowing 256 values, but the first bit is fixed to 0, so only 128 characters are defined.
Unicode : A universal character set that assigns a unique code point to every character in virtually all writing systems.
UTF‑8 : One of the encoding schemes for Unicode. It uses a variable number of bytes (1‑4) per code point. The article shows how the character “凹” (U+51F9) is encoded as three UTF‑8 bytes E5 87 B9 ("%E5%87%B9").
encodeURI('凹') // "%E5%87%B9"Explanation of the three‑byte UTF‑8 sequence:
The first byte starts with "1110", indicating a three‑byte character.
The following two bytes start with "10", marking continuation bytes.
The decoder knows to combine the three bytes to form a single Unicode symbol.
6. References
http://www.w3school.com.cn/jsref/jsref_escape.asp
http://www.w3school.com.cn/jsref/jsref_encodeURI.asp
http://www.w3school.com.cn/jsref/jsref_encodeURIComponent.asp
https://zh.wikipedia.org/wiki/%E7%99%BE%E5%88%86%E5%8F%B7%E7%BC%96%E7%A0%81
https://www.zhihu.com/question/21861899
http://www.ituring.com.cn/book/miniarticle/44590
https://kb.cnblogs.com/page/133765/
http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html
JD Tech
Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.