Why Traditional Block Compression Fails and How Terark’s Searchable Compression Transforms Databases
This article examines the limitations of conventional block compression in modern databases, introduces FM-Index and its drawbacks, and explains how Terark's searchable compression—combining CO-Index for keys and PA‑Zip for values—delivers higher compression ratios, faster random reads, and eliminates double‑caching overhead.
As a database with limited CPU, memory, SSD and disk resources, we aim to store more data and achieve faster access by using compression, though most algorithms rely heavily on context.
More data stored : compression algorithms vary widely.
Faster access : faster compression (write) and decompression (read) algorithms, larger caches.
Almost all compression algorithms depend on context: adjacent data tends to be more redundant, and larger contexts increase the theoretical compression limit.
Block Compression
Traditional block compression in databases
For ordinary block/file‑based compression, traditional stream compressors work well (gzip, bzip2, Snappy, LZ4, Zstd). gzip offers balanced compression ratio, speed and decompression speed; bzip2 provides high compression at the cost of speed; Snappy and LZ4 prioritize speed over ratio; Zstd balances high ratio with good speed.
Historically, B‑tree block sizes were large, allowing these compressors to obtain ample context. With modern SSDs, PCIe SSDs and large memory, block compression shows drawbacks:
Small blocks reduce compression ratio; large blocks hurt performance.
Block compression only saves cheaper, larger storage.
Large memory caches become wasteful (double‑cache problem).
Consequently, many latency‑sensitive applications disable compression.
Principles of block compression
General compressors (Snappy, LZ4, gzip, bzip2, Zstd) are applied per block/page (typically 4 KB–32 KB, TokuDB up to 2 MB–4 MB). Enabling compression slows access because:
Writes pack many records into a larger block, increasing context and compression ratio; smaller blocks lower the ratio.
Reads must decompress the entire block even for tiny data, so larger blocks cause more unnecessary decompression.
To mitigate this, databases use a dedicated cache for decompressed data, leading to double caching (OS cache for compressed data and DB cache for decompressed data) and slower queries when caches miss.
RocksDB block compression
RocksDB’s BlockBasedTable uses block compression: the index points to blocks, each containing many (key, value) pairs (M entries). Larger M reduces index size and increases block size, giving compressors more context and higher compression.
During creation, key‑value pairs fill a buffer; when the buffer reaches the configured block size, it is compressed and written, recording the file offset and first key for indexing. After all data are written, an index is built from these recorded offsets and keys; the file ends with footer metadata.
Search proceeds by locating the block containing the search key, checking the DB cache for the decompressed block, and if missing, reading and decompressing the block from disk/SSD. RocksDB offers various DB cache implementations (LRU, Clock, Counting). Setting the DIRECT_IO option bypasses the OS cache, using only the DB cache and reducing memory usage at some performance cost.
Traditional non‑mainstream compression: FM‑Index
FM‑Index (Full Text Matching Index) belongs to the Succinct Data Structure family, providing compression and direct search/extract on compressed data. It operates offline, based on BWT, supporting two main operations: data = extract(offset, length) and {offset} = search(string).
Drawbacks of FM‑Index include complex implementation, lower compression ratio than stream compressors, slow search and extract (≈7 MB/s on a 2016 i7‑6700K), high memory usage during compression, and a flat‑text model unsuitable for key‑value databases.
A simple conversion to a key‑value model can be done by surrounding each key with a unique delimiter (e.g., ‘#’) and using search(#key#) to locate the value.
Terark’s Searchable Compression
Terark proposes “searchable compression” that operates directly on compressed data, using separate global compressors for keys and values:
Discard traditional block compression in favor of global compression.
Apply different global compressors to keys and values.
Use CO‑Index (Compressed Ordered Index) for searchable key compression.
Use PA‑Zip (Point Accessible Zip) for random‑access value compression.
Key compression: CO‑Index
CO‑Index provides a compact ordered index for keys, implemented via a Nested Succinct Trie. Compared to traditional indexes, it reduces space usage by an order of magnitude while supporting rich search features:
Exact search
Range search
Sequential traversal
Prefix search
Regex search (without full scan)
CO‑Index leverages Succinct Data Structures, which represent objects near the information‑theoretic limit using bitmaps with rank/select operations. Although memory‑efficient, they have higher constant‑time overhead.
Succinct Data Structure overview
Succinct structures encode data using bitmaps; rank/select locate positions. Implementations like SDSL‑Lite exist, but Terark uses its own high‑performance rank‑select.
Succinct Tree example
Traditional binary tree nodes store two pointers (2 ptr). Optimizing pointer size to log2(N) bits reduces space to 2·log2(N) bits per node. For 10⁹ nodes, a conventional optimized tree needs ~7.5 GB, while a succinct representation needs ~312.5 MB (≈2.5 bits per node).
Succinct Trie = Succinct Tree + Trie labels
Using LOUDS encoding, a Succinct Trie can store keys like “hat”, “is”, “it”, “a”. Path compression and nesting further improve compression, resulting in the Nested Succinct Trie used in CO‑Index.
Value compression: PA‑Zip
PA‑Zip compresses all values globally, assigning each a unique ID for direct access. Benefits include higher compression ratio, elimination of double caching, and fast random reads:
Sequential decompression throughput ≈500 MB/s (single‑thread), up to 7 GB/s.
Random decompression ≈300 MB/s (single‑thread), up to 3 GB/s, enabling QPS of 300 k–1 M depending on record size.
Warm‑up simply mmaps the file; performance matches SSD sequential read speed.
Compared to FM‑Index, PA‑Zip offers superior extract speed and compression.
Combining Key and Value
Keys are stored in CO‑Index, values in PA‑Zip. A search returns an internal ID, which directly accesses the corresponding value in PA‑Zip without touching unrelated data, removing the need for a separate DB cache and relying solely on the OS file cache.
Traditional block compression derives from stream compressors and cannot exploit inter‑block redundancy.
FM‑Index provides many features but at the cost of compression ratio and speed; only a tiny subset is needed for key‑value workloads.
Appendix
Compression ratio & performance tests
Datasets: Amazon movie reviews (~9 GB, 8 M records, avg 1 KB), Wikipedia English dump (109 GB → 23 GB), TPC‑H lineitem.
Benchmarks (code open‑source on GitHub) compare compression ratios, random‑read throughput, and latency curves, showing TerarkDB’s superiority over traditional RocksDB block compression.
API interface
TerarkDB = Terark SSTable + RocksDB. RocksDB’s SSTable is plug‑in capable; Terark implements a custom SSTable to expose its compression advantages while preserving RocksDB’s API. Users can switch to TerarkDB by replacing librocksdb.so and setting environment variables, or use the C++ API for fine‑grained control.
Terark also provides a command‑line toolset for compressing data into various formats, with detailed usage in the Terark wiki.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
