Why Columnar Storage Powers Modern Analytics: Design, Encoding, and Real-World Systems
This article explains the history, data layout, encoding, compression techniques, and real‑world implementations of column‑oriented storage, showing how it reduces I/O and improves cache efficiency for analytical workloads while highlighting its trade‑offs for transactional use.
What is Columnar Storage
Column‑oriented storage was first described in the 1983 Cantor paper. It fell out of favor when hardware limited its benefits, and row‑based storage became dominant for OLTP workloads. With the rise of analytical (OLAP) queries, columnar layouts regained attention because they reduce storage footprint, lower I/O, and enable computation‑friendly data structures.
Row‑vs‑Column Layout
Traditional OLTP databases store rows contiguously, often using B+‑tree or SS‑Table indexes on primary keys. This layout is optimal for CRUD operations that touch whole rows. In a columnar layout, all values of a single column are stored together as a long array, allowing independent scanning of each column.
Benefits for OLAP include:
Only the columns referenced by a query need to be scanned.
Each column contains homogeneous data, which yields higher compression ratios.
Decomposition Storage Model (DSM) Paging
Disk‑based storage abstracts blocks as pages; aligning database pages with physical sectors improves read/write efficiency. Most OLTP DBMS use the N‑ary Storage Model (NSM), storing whole rows per page with an index of row offsets. NSM wastes I/O when queries need only a few columns. Analytical databases adopt the Decomposition Storage Model (DSM), storing each column on separate pages (column paging). An index at the page tail points to column offsets, enabling column‑wise scans.
Column Encoding and Compression
I/O is often the bottleneck for both disk‑based and in‑memory databases, so effective compression improves performance. C‑Store’s encoding schemes illustrate four cases based on data order (self‑ordered vs. unordered) and distinct‑value count (few vs. many):
Ordered & few distinct values : Encode runs as triples (value, first_row, count). Example: value 4 appears from row 12 to row 18 → (4,12,7).
Unordered & few distinct values : Build a bitmap per distinct value; sparse bitmaps can be further run‑length encoded.
Ordered & many distinct values : Store deltas between successive values. Example: 1,4,7,7,8,12 → 1,3,3,0,1,4.
Unordered & many distinct values : No efficient encoding; data is stored raw.
After encoding, column data is compressed with algorithms such as Snappy, which exploit the similarity of values within a column to achieve high compression ratios.
Columnar Storage on Distributed File Systems
Modern big‑data architectures store data on distributed file systems (e.g., GFS, HDFS). Network latency dominates, so sequential large‑block reads are preferred, while random writes are limited to batch sizes of tens of megabytes.
Sequential large‑block reads maximize throughput.
Writes are typically performed as append‑only batches to amortize network overhead.
Representative Columnar Systems
C‑Store / Vertica (2005)
C‑Store was designed as a read‑optimized analytical engine (the precursor of modern HTAP systems). It introduces Projections , vertical partitions of a table that may contain multiple columns and optional indexes. Queries select a covering set of Projections and join them via a Join Index. Redundant Projections provide fault tolerance (K‑safe) because a query can succeed as long as at least one covering set remains available.
Apache ORC
ORC (Optimized Row Columnar) was created for Hive and is now widely used in the Hadoop ecosystem. It is a self‑describing file format with three index levels:
File level : Footer stores column statistics (min/max, null distribution, Bloom filters) and overall metadata.
Stripe level : A Stripe corresponds to a range partition of the original table; each Stripe contains column data for that range and its own footer‑level index.
Row‑Group level : Within a column, every 10 000 rows form a Row‑Group with its own row‑level statistics.
ORC writes data in large Stripes (analogous to DB pages) to amortize network latency. For ACID‑like support, ORC uses immutable Delta files that overlay base data. Periodic minor and major compactions merge Delta files into the base, a pattern common to many columnar systems.
Dremel / Apache Parquet (2010)
Dremel is Google’s large‑scale, read‑only query engine that stores data directly on GFS. Its columnar format inspired Apache Parquet. Dremel encodes nested Protobuf structures using two auxiliary columns:
Repetition Level (R) : Indicates the depth at which a repeated field continues.
Definition Level (D) : Indicates the depth at which a value is defined (i.e., not NULL).
These levels allow a state machine to reconstruct the original nested records while skipping irrelevant data. The following image shows a schema, two example documents, and the resulting columnar representation with R and D columns.
Common Design Patterns in Columnar Systems
Skip irrelevant data : Columnar layouts avoid scanning unused columns; multi‑level indexes (e.g., ORC) prune whole data blocks.
Encoding doubles as indexing : Distinct‑value run‑length encoding, bitmap indexes, and dictionary encodings serve both compression and index purposes.
Assume immutable data : Systems treat base data as append‑only; updates are stored in separate Delta files and merged lazily.
Vertical and horizontal partitioning : Large‑scale deployments split data both by columns (vertical) and by shards/blocks (horizontal) to achieve scalability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
