Fundamentals 14 min read

Understanding Character Encoding and Redis SDS Dynamic String Implementation

This article explains how computers store text using binary, introduces ASCII, Unicode and UTF‑8 encoding rules, discusses the limitations of C‑style null‑terminated strings, and describes Redis's Simple Dynamic String (SDS) data structure, its old and new versions, advantages, and related APIs.

Xueersi Online School Tech Team

Jul 9, 2021

Understanding Character Encoding and Redis SDS Dynamic String Implementation

Computers store all information as binary bits; a byte consists of 8 bits, representing values from 0 to 255. To map characters to binary, the 1960s ASCII standard defined 128 characters (0‑127) with a fixed one‑byte representation.

Because ASCII cannot cover all world languages, the Unicode standard was created to assign a unique code point to every character in every script, resulting in over a million symbols. Unicode only defines the code points; a variable‑length encoding such as UTF‑8 is needed to store them efficiently.

UTF‑8 encodes a Unicode code point using 1 to 4 bytes. If the first bit of a byte is 0, the byte represents an ASCII character. If the first bits are 110, 1110, or 11110, the byte starts a 2‑, 3‑, or 4‑byte sequence respectively, and each continuation byte begins with 10. For example, the Chinese character 严 (U+4E25) is encoded as the three‑byte sequence E4 B8 A5.

C strings terminate with a null byte (\0), which leads to several problems: embedded null bytes truncate the string, the length must be computed with an O(N) scan (e.g., strlen()), and frequent resizing can cause buffer overflows and memory fragmentation.

Redis solves these issues with the Simple Dynamic String (SDS) structure. An old‑style SDS contains three fields: len (used length), free (unused space), and buf[] (a flexible array holding the characters plus a terminating \0). This design enables O(1) length retrieval, binary‑safe storage, and reduced reallocations.

Newer SDS versions (sdshdr5, sdshdr8, sdshdr16, sdshdr32, sdshdr64) use variable‑size headers. Each header stores len, alloc (total allocated bytes), flags (type identifier), and buf[]. The __attribute__((packed)) qualifier forces a compact memory layout, shrinking header sizes from 24 bytes to as low as 17 bytes for the largest type.

The SDS API provides macros and functions to select the appropriate header type based on the required length, compute the header size, obtain free capacity, get or set the string length, and expand the buffer. Expansion follows a simple algorithm: if the new length is ≤ 1 MiB, the buffer size is doubled; otherwise, 1 MiB is added, and the header type may change, requiring a reallocation and data copy.

Additional SDS operations include releasing memory (either fully freeing or keeping the allocation for reuse) and various utility functions for concatenation, trimming, and other common string manipulations.

The article concludes with a summary of the discussed concepts and references to Redis design books and character‑encoding notes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Redis Unicode UTF-8 character encoding SDS C strings dynamic strings

Written by

Xueersi Online School Tech Team

The Xueersi Online School Tech Team, dedicated to innovating and promoting internet education technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.