Understanding Character Encoding and Redis SDS Dynamic String Implementation
This article explains how computers store text using binary, introduces ASCII, Unicode and UTF‑8 encoding rules, discusses the limitations of C‑style null‑terminated strings, and describes Redis's Simple Dynamic String (SDS) data structure, its old and new versions, advantages, and related APIs.
Computers store all information as binary bits; a byte consists of 8 bits, representing values from 0 to 255. To map characters to binary, the 1960s ASCII standard defined 128 characters (0‑127) with a fixed one‑byte representation.
Because ASCII cannot cover all world languages, the Unicode standard was created to assign a unique code point to every character in every script, resulting in over a million symbols. Unicode only defines the code points; a variable‑length encoding such as UTF‑8 is needed to store them efficiently.
UTF‑8 encodes a Unicode code point using 1 to 4 bytes. If the first bit of a byte is 0, the byte represents an ASCII character. If the first bits are 110, 1110, or 11110, the byte starts a 2‑, 3‑, or 4‑byte sequence respectively, and each continuation byte begins with 10. For example, the Chinese character 严 (U+4E25) is encoded as the three‑byte sequence E4 B8 A5.
C strings terminate with a null byte (\0), which leads to several problems: embedded null bytes truncate the string, the length must be computed with an O(N) scan (e.g., strlen() ), and frequent resizing can cause buffer overflows and memory fragmentation.
Redis solves these issues with the Simple Dynamic String (SDS) structure. An old‑style SDS contains three fields: len (used length), free (unused space), and buf[] (a flexible array holding the characters plus a terminating \0). This design enables O(1) length retrieval, binary‑safe storage, and reduced reallocations.
Newer SDS versions (sdshdr5, sdshdr8, sdshdr16, sdshdr32, sdshdr64) use variable‑size headers. Each header stores len , alloc (total allocated bytes), flags (type identifier), and buf[] . The __attribute__((packed)) qualifier forces a compact memory layout, shrinking header sizes from 24 bytes to as low as 17 bytes for the largest type.
The SDS API provides macros and functions to select the appropriate header type based on the required length, compute the header size, obtain free capacity, get or set the string length, and expand the buffer. Expansion follows a simple algorithm: if the new length is ≤ 1 MiB, the buffer size is doubled; otherwise, 1 MiB is added, and the header type may change, requiring a reallocation and data copy.
Additional SDS operations include releasing memory (either fully freeing or keeping the allocation for reuse) and various utility functions for concatenation, trimming, and other common string manipulations.
The article concludes with a summary of the discussed concepts and references to Redis design books and character‑encoding notes.
Xueersi Online School Tech Team
The Xueersi Online School Tech Team, dedicated to innovating and promoting internet education technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.