How DuckDB Compression Supercharges AliSQL Storage and Cuts MySQL Costs
AliSQL integrates DuckDB as its storage engine to achieve high‑density columnar compression and fast analytical scans, detailing DuckDB’s multi‑layer storage format, adaptive compression algorithm selection, performance benchmarks versus InnoDB, HBase, ClickHouse, OceanBase, and the engineering optimizations AliSQL adds for throughput and cost reduction.
Introduction
As business data grows, storage overhead and the need for historical data analysis become bottlenecks. InnoDB’s row‑oriented layout inflates disk usage for wide tables and log‑heavy workloads, while users still require long‑term data for analytical processing (AP) queries.
DuckDB Compression Architecture
AliSQL adopts DuckDB as its storage engine, allowing MySQL to retain ecosystem compatibility while gaining native high‑density storage and efficient column‑scan analytics. DuckDB stores data column‑wise and, during checkpoint, applies adaptive encoding and compression per column, dramatically reducing on‑disk size and improving scan performance.
Storage Format Layers
DuckDB’s table format consists of four hierarchical layers: a Table is split horizontally into Row Groups; each Row Group is split vertically into Columns (Column Data); each Column Data is further split horizontally into Column Segments; each Column Segment maps to a ~256 KB Data Block on disk, though blocks may be shared.
Compression Process
Read compression‑related configuration.
Initialize the analysis state set ( analyze_states).
Iterate over all Column Segments in a Column Data; for each column (data column and validity column) run each candidate compression function, recording results in analyze_states. Inapplicable algorithms are discarded.
Score all candidate algorithms and select the one with the smallest post‑compression size.
Return the chosen compression algorithm set.
Compression Algorithms
Numeric/Boolean types – various RLE, Bit‑packing, ZSTD, etc.
String types – dictionary, FSST, etc.
Floating‑point types – specialized encodings.
Validity (bitmap) types – Roaring bitmap, etc.
Compression Effectiveness
Columnar storage and adaptive encoding in DuckDB dramatically shrink disk footprints. In a sysbench benchmark (25 tables, 20 M rows each), InnoDB consumes ~127 GB while DuckDB uses only ~43 GB. In TPC‑H SF100, InnoDB occupies ~168 GB versus DuckDB’s ~26 GB, a 6.5× reduction. Real‑world production archives also show similar gains, e.g., a local‑service platform reduced 133 TB to 66 TB, and a consumer‑finance platform cut 94 TB to 49 TB.
Comparison with Other Systems
Traditional MySQL/InnoDB stores data in B+Tree pages without compression by default, leading to large disk usage for analytical workloads. Systems designed for analytics—HBase (LSM‑tree with block‑level compression), ClickHouse (columnar with per‑column codecs), Doris/SelectDB (columnar segments with dictionary/RLE), and OceanBase (micro‑block encoding + macro‑block compression)—all achieve better compression than InnoDB, but DuckDB still yields the smallest on‑disk size among them.
AliSQL Compression Optimizations
Reuse of compression/encoding strategies: cache the column‑level encoding and compression decisions to avoid repeated CPU‑intensive analysis.
Concurrent path improvements: reduce lock contention in critical compression paths, stabilizing multi‑threaded throughput.
Sampling and parallel pipelines: lower the cost of strategy selection via sampling and accelerate the compression stage with multi‑threaded pipelines.
Hardware‑assisted storage compression: combine software compression with intelligent hardware chips and enterprise‑grade NAND flash to achieve up to 50 % storage cost reduction without impacting compute resources.
Conclusion
DuckDB’s compression advantage stems from its columnar organization plus adaptive algorithm selection, using Column Data as the decision unit and applying specialized encodings such as RLE, Bit‑packing, dictionary, ALP/Chimp, and Roaring. This yields significantly smaller on‑disk footprints and lower I/O for analytical queries, making DuckDB‑backed AliSQL an effective solution for low‑cost, high‑performance historical data archiving and AP analysis within the familiar MySQL interface.
References
https://mp.weixin.qq.com/s/GeCMoC43T7sIMzZS94Yq3A
https://github.com//oceanbase
https://github.com/apache/hbase
https://github.com/ClickHouse/ClickHouse
https://github.com/apache/doris
https://github.com/duckdb/duckdb
https://help.aliyun.com/zh/rds/apsaradb-rds-for-mysql/storage-compression
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
