Databases 22 min read

How In-Memory Databases Organize Data and Indexes: Techniques and System Comparisons

This article explains how modern in‑memory database systems structure data into fixed‑ and variable‑length blocks, choose partitioned or non‑partitioned architectures, handle row versus column storage, and implement cache‑aware index structures such as CSB+-Tree, PB+-Tree, Bw‑Tree, Adaptive Radix, OLFIT and Skiplists, illustrated with Hekaton, H‑Store/VoltDB, HyPer and SAP HANA examples.

StarRing Big Data Open Lab

Oct 14, 2020

How In-Memory Databases Organize Data and Indexes: Techniques and System Comparisons

Data Organization in In‑Memory Databases

When managing data entirely in memory, a DBMS still groups records into blocks (pages). Two block types are used: Fixed‑Length Data Blocks for attributes that fit into a small, constant size (e.g., < 8 bytes) and Variable‑Length Data Blocks for longer or variable‑size attributes. Fixed blocks enable fast address calculation and compact pointers, while variable blocks store the actual data with a pointer from the fixed block.

Fixed‑Length vs Variable‑Length Blocks

All fixed‑length attributes are placed in a dedicated block; variable‑length attributes are stored separately, with a pointer in the fixed block. This layout reduces indexing overhead and improves CPU pre‑fetch accuracy.

Partition vs Non‑Partition Systems

In multi‑core or multi‑CPU environments, systems are classified as Partition Systems (data split into disjoint partitions, each bound to a core or node, eliminating concurrency) or Non‑Partition Systems (all threads can access any data, requiring concurrency‑friendly structures). Partition systems achieve the best performance when data can be cleanly divided, but are hard to apply to general workloads.

Row vs Column Storage

Traditional disk‑based DBMSs choose row or column storage based on query patterns. In memory, the cost difference disappears, so both row and column layouts are viable. Transactional (TP) workloads usually prefer row storage for full‑record reads, while analytical (AP) workloads benefit from columnar access. SAP HANA combines both: a row‑store front‑end and a column‑store back‑end.

Representative In‑Memory Systems

Hekaton

Hekaton is a Non‑Partition system that uses multi‑version concurrency control. Each record version carries a start and end timestamp. Tables may have up to eight indexes (hash or range). Record versions are stored non‑contiguously and linked via pointer links.

H‑Store/VoltDB

H‑Store/VoltDB is a Partition System. Each partition runs on a single node and executes transactions serially. A transaction must acquire locks on all involved partitions before it can start. The architecture consists of a Transaction Coordinator (deciding cross‑partition execution) and an Execution Engine that stores data in a single‑version row layout.

HyPer

HyPer is a multi‑version Non‑Partition system designed for HTAP (Hybrid Transactional/Analytical Processing). It uses a Copy‑on‑Write snapshot mechanism: a forked OS process creates a read‑only snapshot for analytical queries while the original process continues OLTP updates. Only changed pages are copied; unchanged pages are shared via virtual addresses.

SAP HANA

SAP HANA is a Non‑Partition hybrid system. Records flow through three stages: L1‑Delta (row‑store for transaction processing), L2‑Delta (column‑store with unsorted dictionary encoding), and the main memory column store (highly compressed, sorted dictionary). This pipeline provides a multi‑version view of each record.

Index Techniques in In‑Memory DBMS

Because disk I/O is eliminated, index design focuses on cache‑awareness and multi‑core parallelism. Reducing memory‑stall cycles and improving pre‑fetch efficiency are the primary goals.

CSB+-Tree

CSB+-Tree keeps node sizes as multiples of a cache line and groups children of a node into a contiguous Children Group, reducing pointer overhead and improving sequential cache reads. Nodes are re‑allocated during splits without requiring contiguous memory.

PB+-Tree

PB+-Tree is a cache‑line‑aligned B+-Tree that adds pre‑fetch information to each node. It achieves up to 1.5× faster point searches and 6× faster scans compared with a baseline B+-Tree.

Bw‑Tree

Bw‑Tree, used by Hekaton, relies on atomic Compare‑and‑Swap operations instead of locks. Nodes are referenced via a Mapping Table that stores logical pointers; updates are recorded as delta records and merged at read time, providing a latch‑free structure.

Adaptive Radix Tree

Adaptive Radix Tree (ART) uses node types of varying fan‑out (Node4, Node16, Node48, Node256) that match the number of child keys, keeping each node size a multiple of a cache line. Path compression removes single‑child nodes, further improving cache locality.

OLFIT on B+-Tree

OLFIT (Optimistic Latch‑Free Index Access Protocol) eliminates latch overhead for reads by using version numbers. Writers acquire a latch, but readers validate that the version number has not changed, preventing cache‑coherence invalidations across CPUs.

Skiplists

Skiplists, employed by MemSQL, organize data as a hierarchy of ordered lists where higher levels sample elements with a probability P. Insertions use atomic Compare‑and‑Swap, avoiding locks while supporting fast searches similar to binary search.

Conclusion

The article first described data organization in in‑memory databases, covering block layout, partitioning strategies, and row/column storage, and compared four real‑world systems. It then surveyed six index techniques, illustrating their design principles and performance benefits. The next part will examine concurrency control, durability, and query compilation in memory‑optimized DBMSs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Partitioning database systems index structures data organization

Written by

StarRing Big Data Open Lab

Focused on big data technology research, exploring the Big Data era | [email protected]

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.