Databases 12 min read

How DuckDB Compression Supercharges AliSQL Storage and Cuts MySQL Costs

AliSQL integrates DuckDB as its storage engine to achieve high‑density columnar compression and fast analytical scans, detailing DuckDB’s multi‑layer storage format, adaptive compression algorithm selection, performance benchmarks versus InnoDB, HBase, ClickHouse, OceanBase, and the engineering optimizations AliSQL adds for throughput and cost reduction.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How DuckDB Compression Supercharges AliSQL Storage and Cuts MySQL Costs

Introduction

As business data grows, storage overhead and the need for historical data analysis become bottlenecks. InnoDB’s row‑oriented layout inflates disk usage for wide tables and log‑heavy workloads, while users still require long‑term data for analytical processing (AP) queries.

DuckDB Compression Architecture

AliSQL adopts DuckDB as its storage engine, allowing MySQL to retain ecosystem compatibility while gaining native high‑density storage and efficient column‑scan analytics. DuckDB stores data column‑wise and, during checkpoint, applies adaptive encoding and compression per column, dramatically reducing on‑disk size and improving scan performance.

Storage Format Layers

DuckDB’s table format consists of four hierarchical layers: a Table is split horizontally into Row Groups; each Row Group is split vertically into Columns (Column Data); each Column Data is further split horizontally into Column Segments; each Column Segment maps to a ~256 KB Data Block on disk, though blocks may be shared.

Compression Process

Read compression‑related configuration.

Initialize the analysis state set ( analyze_states).

Iterate over all Column Segments in a Column Data; for each column (data column and validity column) run each candidate compression function, recording results in analyze_states. Inapplicable algorithms are discarded.

Score all candidate algorithms and select the one with the smallest post‑compression size.

Return the chosen compression algorithm set.

Compression Algorithms

Numeric/Boolean types – various RLE, Bit‑packing, ZSTD, etc.

String types – dictionary, FSST, etc.

Floating‑point types – specialized encodings.

Validity (bitmap) types – Roaring bitmap, etc.

Compression Effectiveness

Columnar storage and adaptive encoding in DuckDB dramatically shrink disk footprints. In a sysbench benchmark (25 tables, 20 M rows each), InnoDB consumes ~127 GB while DuckDB uses only ~43 GB. In TPC‑H SF100, InnoDB occupies ~168 GB versus DuckDB’s ~26 GB, a 6.5× reduction. Real‑world production archives also show similar gains, e.g., a local‑service platform reduced 133 TB to 66 TB, and a consumer‑finance platform cut 94 TB to 49 TB.

Comparison with Other Systems

Traditional MySQL/InnoDB stores data in B+Tree pages without compression by default, leading to large disk usage for analytical workloads. Systems designed for analytics—HBase (LSM‑tree with block‑level compression), ClickHouse (columnar with per‑column codecs), Doris/SelectDB (columnar segments with dictionary/RLE), and OceanBase (micro‑block encoding + macro‑block compression)—all achieve better compression than InnoDB, but DuckDB still yields the smallest on‑disk size among them.

AliSQL Compression Optimizations

Reuse of compression/encoding strategies: cache the column‑level encoding and compression decisions to avoid repeated CPU‑intensive analysis.

Concurrent path improvements: reduce lock contention in critical compression paths, stabilizing multi‑threaded throughput.

Sampling and parallel pipelines: lower the cost of strategy selection via sampling and accelerate the compression stage with multi‑threaded pipelines.

Hardware‑assisted storage compression: combine software compression with intelligent hardware chips and enterprise‑grade NAND flash to achieve up to 50 % storage cost reduction without impacting compute resources.

Conclusion

DuckDB’s compression advantage stems from its columnar organization plus adaptive algorithm selection, using Column Data as the decision unit and applying specialized encodings such as RLE, Bit‑packing, dictionary, ALP/Chimp, and Roaring. This yields significantly smaller on‑disk footprints and lower I/O for analytical queries, making DuckDB‑backed AliSQL an effective solution for low‑cost, high‑performance historical data archiving and AP analysis within the familiar MySQL interface.

References

https://mp.weixin.qq.com/s/GeCMoC43T7sIMzZS94Yq3A

https://github.com//oceanbase

https://github.com/apache/hbase

https://github.com/ClickHouse/ClickHouse

https://github.com/apache/doris

https://github.com/duckdb/duckdb

https://help.aliyun.com/zh/rds/apsaradb-rds-for-mysql/storage-compression

MySQLdatabase optimizationdata compressionColumnar StorageDuckDBAliSQL
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.