Fundamentals 11 min read

Understanding URL Encoding: escape, encodeURI, encodeURIComponent, Percent‑Encoding, ASCII, Unicode and UTF‑8

This article explains the differences between JavaScript's escape, encodeURI and encodeURIComponent functions, the principles of percent‑encoding, the classification of reserved, unreserved and unsafe characters, and provides an overview of ASCII, Unicode and UTF‑8 character encodings.

JD Tech

Sep 19, 2018

Understanding URL Encoding: escape, encodeURI, encodeURIComponent, Percent‑Encoding, ASCII, Unicode and UTF‑8

The World of URL Encoding Is Fascinating, Take a Look

The article begins with an introduction to the JDC Multi‑Terminal R&D Lab, which focuses on front‑end capabilities such as web, mini‑programs, games and H5 animations.

1. Starting with escape and encodeURI

Assuming you already know how escape works:

It does not encode ASCII letters and digits.

It does not encode the characters *@-_+./.

All other characters are replaced by escape sequences.

escape('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789')
// "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"

escape('*@-_+./')
// "*@-_+./"

Assuming you already know how encodeURI works:

It does not encode ASCII letters and digits.

It does not encode the 20 ASCII punctuation characters -_.!~*'();/?:@&=+$,#.

All other characters are replaced by escape sequences.

encodeURI('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789')
// "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"

encodeURI('-_.!~*\'();/?:@&=+$,#')
// "-_.!~*'();/?:@&=+$,#"

Both functions treat the string "凹凸" differently:

escape('凹凸')
// "%u51F9%u51F8"

encodeURI('凹凸')
// "%E5%87%B9%E5%87%B8"

2. Percent‑Encoding

Percent‑encoding (also called URL encoding) represents characters as a percent sign followed by two hexadecimal digits. The key difference is that encodeURI follows the W3C standard (RFC 3986), while escape is non‑standard.

Common point: ASCII characters that need encoding are represented as %xx.

Standard (encodeURI): Non‑ASCII characters are first converted to UTF‑8 bytes, then each byte is percent‑encoded.

Non‑standard (escape): Non‑ASCII characters are encoded as %uxxxx, where xxxx is the Unicode code point in four hex digits.

3. Reserved, Unreserved and Unsafe Characters

According to RFC 3986, a URL may contain unreserved characters (letters, digits, -_.~) and reserved characters. Unreserved characters do not need percent‑encoding.

Reserved characters have special meanings (e.g., :/?#[]@!$&'()*+,;=). Some characters are considered unsafe and should be encoded because they can cause ambiguity or be altered by proxies.

Unsafe Character

Why Unsafe

Example % Used as the escape marker, therefore it must be encoded. encodeURI('%') // "%25" Space

Spaces may be introduced or stripped during transmission.

encodeURI(' ') // "%20"

<>"

Angle brackets and quotes are often used to delimit URLs in text. encodeURI('<>"') // "%3C%3E%22" {}|\^~[]'

Some gateways or proxies may tamper with these characters. encodeURI("{}|\\^~[]'") // "%7B%7D%26%7C%5E~%5B%5D'" 0x00‑0x1F, 0x7F

Control characters are non‑printable.

e.g., line feed 0x0A

>0x7F

Characters outside the 7‑bit ASCII range. encodeURI('京东') // "%E4%BA%AC%E4%B8%9C" Thus encodeURI does not encode 82 characters: 66 unreserved + 18 reserved, minus the two unsafe reserved characters [].

4. encodeURI vs. encodeURIComponent

encodeURIComponent

assumes the input is a component of a URI (e.g., query string) and therefore encodes characters that separate URI parts. Its unencoded set contains only 71 characters.

encodeURIComponent('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789,/?:@&=+$#')
// "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789%2C%2F%3F%3A%40%26%3D%2B%24%23"

Example comparison:

encodeURIComponent('https://aotu.io/') // "https%3A%2F%2Faotu.io%2F"
encodeURI('https://aotu.io/') // "https://aotu.io/"

5. Character Encodings

ASCII : 1 byte = 8 bits, allowing 256 values, but the first bit is fixed to 0, so only 128 characters are defined.

Unicode : A universal character set that assigns a unique code point to every character in virtually all writing systems.

UTF‑8 : One of the encoding schemes for Unicode. It uses a variable number of bytes (1‑4) per code point. The article shows how the character “凹” (U+51F9) is encoded as three UTF‑8 bytes E5 87 B9 ("%E5%87%B9"). encodeURI('凹') // "%E5%87%B9" Explanation of the three‑byte UTF‑8 sequence:

The first byte starts with "1110", indicating a three‑byte character.

The following two bytes start with "10", marking continuation bytes.

The decoder knows to combine the three bytes to form a single Unicode symbol.

6. References

http://www.w3school.com.cn/jsref/jsref_escape.asp

http://www.w3school.com.cn/jsref/jsref_encodeURI.asp

http://www.w3school.com.cn/jsref/jsref_encodeURIComponent.asp

https://zh.wikipedia.org/wiki/%E7%99%BE%E5%88%86%E5%8F%B7%E7%BC%96%E7%A0%81

https://www.zhihu.com/question/21861899

http://www.ituring.com.cn/book/miniarticle/44590

https://kb.cnblogs.com/page/133765/

http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Unicode UTF-8 ASCII encodeURI encodeURIComponent escape percent-encoding URL encoding

Written by

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.