Databases 14 min read

How Dameng Implements Columnar Storage, Smart Indexes, and Adaptive Compression

This article explains Dameng's columnar storage architecture, the smart index mechanism that leverages zone statistics to reduce I/O, and the adaptive compression algorithms—including dictionary, constant, RLE, and sequence encoding—used to achieve high compression ratios on columnar data.

dbaplus Community
dbaplus Community
dbaplus Community
How Dameng Implements Columnar Storage, Smart Indexes, and Adaptive Compression

1. Columnar Storage Data Organization

Dameng supports both row and column storage. Columnar tables, called HUGE tables, are built on an HTS tablespace that functions like a simple file system. When a HFS table is created, the system creates a schema directory named SCH +9‑digit ID, then a table directory named TAB +4‑digit ID. Each column gets a .dta file whose name follows the pattern

COL<4‑digit column number>_<10‑digit file number>

, with a default size of 64 MB that can grow automatically. Files are divided into zones (the smallest data‑management unit, similar to a page in row storage) that are 4 KB aligned. An auxiliary table with 15 columns (COLID, SEC_ID, FILE_ID, OFFSET, COUNT, ACOUNT, N_LEN, N_NULL, N_DIST, MAX_VAL, MIN_VAL, SUM_VAL, CPR_FLAG, ENC_FLAG, CHKSUM) stores metadata for each zone, enabling fast location and statistical analysis.

HTS directory structure for HFS tables
HTS directory structure for HFS tables

2. Smart Index Implementation

During condition scans, Dameng uses zone statistics (minimum, maximum, etc.) to filter out irrelevant zones, reducing unnecessary I/O. This statistical information can replace traditional B‑Tree indexes and is referred to as "smart index". The min/max values enable zone‑skipping, while other statistics allow aggregate functions such as MAX(), COUNT(), and AVG() to be answered directly from the auxiliary table without scanning the data zones.

Smart index min/max filtering illustration
Smart index min/max filtering illustration

3. Adaptive Compression Algorithms

Because column data are stored contiguously, larger compression units and higher data similarity enable far better compression than row storage. Dameng provides four zone‑level encoding strategies:

Dictionary encoding : builds a symbol table for repeated values; column values are stored as IDs referencing the table.

Constant encoding : when a single value dominates a zone, store that value once and record exceptions as <row‑number+value>.

RLE encoding : suitable for long runs of identical values; stores the value and its run length.

Sequence encoding : applied when values form an arithmetic progression or follow a predictable algebraic relationship.

Selection strategy:

If the column is an auto‑increment or sequence, use sequence encoding directly.

Collect zone statistics: distinct value count ( n_dist), frequency of each distinct value, pointers, run lengths, and maximum integer.

Choose encoding in order of preference: constant → RLE → dictionary, based on the distribution of values.

If the encoded size exceeds the original size, skip encoding and keep the raw data.

After encoding, zones may be further compressed with QUICKLZ or LZ‑1‑9 algorithms, achieving several‑fold to hundreds‑fold size reduction, especially for ordered or highly repetitive data.

Dictionary encoding example
Dictionary encoding example
Constant encoding example
Constant encoding example
RLE encoding example
RLE encoding example
Sequence encoding example
Sequence encoding example

4. Q&A Highlights

Smart index data types : Not applicable to large objects such as BLOB or CLOB; works for all other basic data types supported by Dameng.

Dictionary size : One dictionary per zone; its size grows with the number of distinct values. If compression gain does not meet a threshold, the dictionary encoding is abandoned.

Directory naming : The “SCH+9‑digit ID” is the schema code, not the table’s tabid.

Effectiveness on extreme values : Smart index remains effective unless zone min/max values are too close, which would prevent useful filtering.

Null handling : Nulls are treated as a special character; when many rows contain the same null value, compression remains high.

Columnar StorageDamengadaptive compressionsmart index
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.