How In-Memory Databases Organize Data and Indexes: Techniques and System Comparisons
This article explains how modern in‑memory database systems structure data into fixed‑ and variable‑length blocks, choose partitioned or non‑partitioned architectures, handle row versus column storage, and implement cache‑aware index structures such as CSB+-Tree, PB+-Tree, Bw‑Tree, Adaptive Radix, OLFIT and Skiplists, illustrated with Hekaton, H‑Store/VoltDB, HyPer and SAP HANA examples.
Data Organization in In‑Memory Databases
When managing data entirely in memory, a DBMS still groups records into blocks (pages). Two block types are used: Fixed‑Length Data Blocks for attributes that fit into a small, constant size (e.g., < 8 bytes) and Variable‑Length Data Blocks for longer or variable‑size attributes. Fixed blocks enable fast address calculation and compact pointers, while variable blocks store the actual data with a pointer from the fixed block.
Fixed‑Length vs Variable‑Length Blocks
All fixed‑length attributes are placed in a dedicated block; variable‑length attributes are stored separately, with a pointer in the fixed block. This layout reduces indexing overhead and improves CPU pre‑fetch accuracy.
Partition vs Non‑Partition Systems
In multi‑core or multi‑CPU environments, systems are classified as Partition Systems (data split into disjoint partitions, each bound to a core or node, eliminating concurrency) or Non‑Partition Systems (all threads can access any data, requiring concurrency‑friendly structures). Partition systems achieve the best performance when data can be cleanly divided, but are hard to apply to general workloads.
Row vs Column Storage
Traditional disk‑based DBMSs choose row or column storage based on query patterns. In memory, the cost difference disappears, so both row and column layouts are viable. Transactional (TP) workloads usually prefer row storage for full‑record reads, while analytical (AP) workloads benefit from columnar access. SAP HANA combines both: a row‑store front‑end and a column‑store back‑end.
Representative In‑Memory Systems
Hekaton
Hekaton is a Non‑Partition system that uses multi‑version concurrency control. Each record version carries a start and end timestamp. Tables may have up to eight indexes (hash or range). Record versions are stored non‑contiguously and linked via pointer links.
H‑Store/VoltDB
H‑Store/VoltDB is a Partition System. Each partition runs on a single node and executes transactions serially. A transaction must acquire locks on all involved partitions before it can start. The architecture consists of a Transaction Coordinator (deciding cross‑partition execution) and an Execution Engine that stores data in a single‑version row layout.
HyPer
HyPer is a multi‑version Non‑Partition system designed for HTAP (Hybrid Transactional/Analytical Processing). It uses a Copy‑on‑Write snapshot mechanism: a forked OS process creates a read‑only snapshot for analytical queries while the original process continues OLTP updates. Only changed pages are copied; unchanged pages are shared via virtual addresses.
SAP HANA
SAP HANA is a Non‑Partition hybrid system. Records flow through three stages: L1‑Delta (row‑store for transaction processing), L2‑Delta (column‑store with unsorted dictionary encoding), and the main memory column store (highly compressed, sorted dictionary). This pipeline provides a multi‑version view of each record.
Index Techniques in In‑Memory DBMS
Because disk I/O is eliminated, index design focuses on cache‑awareness and multi‑core parallelism. Reducing memory‑stall cycles and improving pre‑fetch efficiency are the primary goals.
CSB+-Tree
CSB+-Tree keeps node sizes as multiples of a cache line and groups children of a node into a contiguous Children Group, reducing pointer overhead and improving sequential cache reads. Nodes are re‑allocated during splits without requiring contiguous memory.
PB+-Tree
PB+-Tree is a cache‑line‑aligned B+-Tree that adds pre‑fetch information to each node. It achieves up to 1.5× faster point searches and 6× faster scans compared with a baseline B+-Tree.
Bw‑Tree
Bw‑Tree, used by Hekaton, relies on atomic Compare‑and‑Swap operations instead of locks. Nodes are referenced via a Mapping Table that stores logical pointers; updates are recorded as delta records and merged at read time, providing a latch‑free structure.
Adaptive Radix Tree
Adaptive Radix Tree (ART) uses node types of varying fan‑out (Node4, Node16, Node48, Node256) that match the number of child keys, keeping each node size a multiple of a cache line. Path compression removes single‑child nodes, further improving cache locality.
OLFIT on B+-Tree
OLFIT (Optimistic Latch‑Free Index Access Protocol) eliminates latch overhead for reads by using version numbers. Writers acquire a latch, but readers validate that the version number has not changed, preventing cache‑coherence invalidations across CPUs.
Skiplists
Skiplists, employed by MemSQL, organize data as a hierarchy of ordered lists where higher levels sample elements with a probability P. Insertions use atomic Compare‑and‑Swap, avoiding locks while supporting fast searches similar to binary search.
Conclusion
The article first described data organization in in‑memory databases, covering block layout, partitioning strategies, and row/column storage, and compared four real‑world systems. It then surveyed six index techniques, illustrating their design principles and performance benefits. The next part will examine concurrency control, durability, and query compilation in memory‑optimized DBMSs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
StarRing Big Data Open Lab
Focused on big data technology research, exploring the Big Data era | [email protected]
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
