How PolarDB’s In-Memory Column Index Turns MySQL into a High‑Performance HTAP Engine
This article explores PolarDB MySQL’s In‑Memory Column Index (IMCI) technology, detailing its hybrid row‑column storage architecture, optimizer enhancements, parallel execution engine, and performance gains that enable real‑time analytical queries alongside OLTP workloads, and compares its benchmarks against MySQL and ClickHouse.
Introduction
Analytical databases have become hot in both capital markets and the tech community, driven by the growing demand for data‑driven growth and the evolution of cloud‑native technologies. PolarDB MySQL, originally designed for OLTP, now addresses real‑time analytical workloads with the In‑Memory Column Index (IMCI) solution, achieving hundreds‑fold speedups in complex queries.
1 MySQL‑centric HTAP Solutions
1 Separate OLTP and OLAP Systems
Using two independent systems for OLTP and OLAP with data synchronization offers flexibility but introduces maintenance overhead, consistency challenges, and latency that hampers real‑time analysis.
2 Divergent Design with Multi‑Replica
NewSQL databases such as TiDB adopt a divergent design: one replica stores row data for OLTP, another replica stores columnar data (e.g., TiFlash) for OLAP, enabling a single system to serve both workloads.
3 Integrated Row‑Column Hybrid Storage
Commercial databases (Oracle, SQL Server, DB2) use hybrid storage that combines row and column formats, leveraging columnar I/O efficiency, compression, and CPU cache friendliness while retaining row‑based indexing for transactional workloads.
2 Evolution of PolarDB MySQL AP Capabilities
1 Limitations of MySQL in AP Scenarios
MySQL’s Volcano iterator model incurs deep function calls and poor CPU pipeline utilization.
Execution is largely serial; parallelism is limited to a few simple queries.
Row‑based storage leads to excessive I/O and memory traffic for analytical scans.
2 PolarDB Parallel Query Breakthrough
PolarDB’s Parallel Query framework automatically launches parallel execution when data volume exceeds a threshold, distributing data across threads and aggregating results, dramatically reducing query latency.
3 >Why Column‑Store Is Needed
Columnar storage reads only required columns, achieves high compression (often >10×), and enables block‑level filtering, reducing I/O.
Columnar layout improves CPU cache usage and allows SIMD vectorization, boosting per‑core throughput.
3 PolarDB In‑Memory Column Index
IMCI adds columnar storage and in‑memory computation to PolarDB, allowing a single database instance to handle both TP and AP workloads while preserving OLTP performance.
Key Technical Innovations
Support for columnar indexes on InnoDB tables; indexes are compressed and can reside in memory or on shared storage.
Rewritten column‑oriented execution engine that processes data in 4K batches, uses SIMD, and supports parallel operators.
Cost‑based optimizer that chooses between row‑store, column‑store, and parallel‑query plans.
RO nodes can be dedicated for analytical queries, isolating AP resources from TP workloads.
Hybrid Row‑Column Optimizer
The optimizer evaluates both row‑store and column‑store costs, applying a whitelist and cost model to decide execution mode, with fallback to row‑store when necessary.
Column‑Oriented Execution Engine
IMCI adopts a batch‑oriented Volcano model, where each operator processes a batch of rows, enabling parallelism and SIMD acceleration for operators such as Scan, Join, and Agg.
Column Index as a Secondary Index
Columnar indexes are implemented as secondary indexes in InnoDB, reusing transaction, redo‑log, and replication mechanisms, and allowing DDL to add or drop columnar attributes on tables or columns.
Data Organization
Data is stored in unordered, append‑only RowGroups composed of column‑wise DataPacks.
Active RowGroups accept writes; once full they are frozen, compressed, and written to disk with statistics for pruning.
Updates are handled via delete‑marks and append‑only writes, preserving transactional consistency.
Rough Index Using Statistics
Each DataPack records min/max, sum, null count, and row count; the optimizer uses these statistics to prune irrelevant packs and even answer some aggregates without full scans.
Resource Isolation for TP and AP
RW instance with hybrid storage for light AP queries.
Separate AP‑only RO node with dedicated CPU/memory for heavy analytical workloads.
Standalone standby node with independent shared storage for full CPU, memory, and I/O isolation.
4 Performance Evaluation
TPC‑H benchmarks (100 GB, 22 queries) show IMCI delivering tens to hundreds of times speedup over native MySQL serial execution, with some queries up to 400× faster, and achieving performance comparable to ClickHouse.
Future Work
Automated index recommendation based on query patterns.
Standalone columnar tables and OSS storage to further reduce costs.
Mixed row‑column execution where parts of a plan run on row‑store and parts on column‑store.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
