Databases 14 min read

How Dameng Implements Columnar Storage, Smart Indexes, and Adaptive Compression

This article explains Dameng's columnar storage architecture, the smart index mechanism that leverages zone statistics to reduce I/O, and the adaptive compression algorithms—including dictionary, constant, RLE, and sequence encoding—used to achieve high compression ratios on columnar data.

dbaplus Community

Jun 16, 2016

How Dameng Implements Columnar Storage, Smart Indexes, and Adaptive Compression

1. Columnar Storage Data Organization

Dameng supports both row and column storage. Columnar tables, called HUGE tables, are built on an HTS tablespace that functions like a simple file system. When a HFS table is created, the system creates a schema directory named SCH +9‑digit ID, then a table directory named TAB +4‑digit ID. Each column gets a .dta file whose name follows the pattern

COL<4‑digit column number>_<10‑digit file number>

, with a default size of 64 MB that can grow automatically. Files are divided into zones (the smallest data‑management unit, similar to a page in row storage) that are 4 KB aligned. An auxiliary table with 15 columns (COLID, SEC_ID, FILE_ID, OFFSET, COUNT, ACOUNT, N_LEN, N_NULL, N_DIST, MAX_VAL, MIN_VAL, SUM_VAL, CPR_FLAG, ENC_FLAG, CHKSUM) stores metadata for each zone, enabling fast location and statistical analysis.

2. Smart Index Implementation

During condition scans, Dameng uses zone statistics (minimum, maximum, etc.) to filter out irrelevant zones, reducing unnecessary I/O. This statistical information can replace traditional B‑Tree indexes and is referred to as "smart index". The min/max values enable zone‑skipping, while other statistics allow aggregate functions such as MAX(), COUNT(), and AVG() to be answered directly from the auxiliary table without scanning the data zones.

Smart index min/max filtering illustration

3. Adaptive Compression Algorithms

Because column data are stored contiguously, larger compression units and higher data similarity enable far better compression than row storage. Dameng provides four zone‑level encoding strategies:

Dictionary encoding : builds a symbol table for repeated values; column values are stored as IDs referencing the table.

Constant encoding : when a single value dominates a zone, store that value once and record exceptions as <row‑number+value>.

RLE encoding : suitable for long runs of identical values; stores the value and its run length.

Sequence encoding : applied when values form an arithmetic progression or follow a predictable algebraic relationship.

Selection strategy:

If the column is an auto‑increment or sequence, use sequence encoding directly.

Collect zone statistics: distinct value count ( n_dist), frequency of each distinct value, pointers, run lengths, and maximum integer.

Choose encoding in order of preference: constant → RLE → dictionary, based on the distribution of values.

If the encoded size exceeds the original size, skip encoding and keep the raw data.

After encoding, zones may be further compressed with QUICKLZ or LZ‑1‑9 algorithms, achieving several‑fold to hundreds‑fold size reduction, especially for ordered or highly repetitive data.

4. Q&A Highlights

Smart index data types : Not applicable to large objects such as BLOB or CLOB; works for all other basic data types supported by Dameng.

Dictionary size : One dictionary per zone; its size grows with the number of distinct values. If compression gain does not meet a threshold, the dictionary encoding is abandoned.

Directory naming : The “SCH+9‑digit ID” is the schema code, not the table’s tabid.

Effectiveness on extreme values : Smart index remains effective unless zone min/max values are too close, which would prevent useful filtering.

Null handling : Nulls are treated as a special character; when many rows contain the same null value, compression remains high.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Columnar Storage Dameng adaptive compression smart index

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.