Fundamentals 11 min read

Understanding Character Encoding: Bits, Bytes, Unicode, UTF-8, UTF-16, and UTF-32

This article explains the origins of character sets, the relationships among various encodings such as ASCII, GB2312, GBK, GB18030, Unicode, UTF-8, UTF-16, and UTF-32, and shows how JavaScript handles Unicode and emoji characters, including practical code examples and solutions for length‑limited input fields.

JD Tech
JD Tech
JD Tech
Understanding Character Encoding: Bits, Bytes, Unicode, UTF-8, UTF-16, and UTF-32

When encountering garbled text or unexpected emoji length in a limited input field, many developers wonder why this happens, what encoding JavaScript uses, and how different character sets relate to each other.

Basic concepts : a bit is the smallest storage unit (0 or 1), a byte consists of 8 bits, and a character is an abstract entity represented by various character sets or code pages.

In computers all information is ultimately a binary sequence. One byte can represent 256 different states, allowing a direct mapping between each 8‑bit value and a symbol.

Historical Chinese encodings : ASCII was created in the 1960s for English. China extended ASCII with GB2312, later expanded to GBK (adding 20,000 characters) and finally GB18030 to include minority scripts.

Unified character set : The ISO and Unicode Consortium collaborated to create a universal character set (UCS). Unicode assigns a unique code point to each symbol; the latest version defines over 109,000 symbols across 17 planes.

Encoding methods :

UCS‑4 uses 32 bits per code point (0–0x7FFFFFFF), but only values up to 0x10FFFF are used.

UTF‑32 is a subset of UCS‑4 limited to 0–0x10FFFF.

UTF‑16 encodes basic‑plane characters with 2 bytes and supplementary‑plane characters with 4 bytes (surrogate pairs).

UCS‑2 is an older 2‑byte fixed‑width encoding; JavaScript originally used UCS‑2 before adopting UTF‑16.

UTF‑8 encodes ASCII characters in a single byte and uses multi‑byte sequences for other symbols, making it bandwidth‑efficient.

Example conversion of a basic‑plane code point:

U+597D => 0x597D

Conversion of a supplementary‑plane code point (illustrated with JavaScript calculations):

H = Math.floor((c - 0x10000) / 0x400) + 0xD800
L = (c - 0x10000) % 0x400 + 0xDC00

In JavaScript, emoji are represented as surrogate pairs (UCS‑2/UTF‑16). When a length‑limited textarea cuts an emoji in half, the result appears as garbled text.

To detect emoji (or any supplementary‑plane character) you can use a regular expression:

var patt = /[\ud800-\udbff][\udc00-\udfff]/g;

Solution: avoid using the native maxlength attribute for textareas; instead, validate input length with JavaScript, checking whether the cut point falls inside an emoji surrogate pair and truncating before the emoji if necessary.

Summary : Unicode defines the character set, while UTF‑8, UTF‑16, UTF‑32, UCS‑2, and UCS‑4 are encoding schemes. Understanding their differences helps prevent encoding‑related bugs such as emoji truncation.

javascriptunicodeUTF-8character encodingUTF-16text processing
JD Tech
Written by

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.