Databases 32 min read

How PolarDB’s In-Memory Column Index Turns MySQL into a High‑Performance HTAP Engine

This article explores PolarDB MySQL’s In‑Memory Column Index (IMCI) technology, detailing its hybrid row‑column storage architecture, optimizer enhancements, parallel execution engine, and performance gains that enable real‑time analytical queries alongside OLTP workloads, and compares its benchmarks against MySQL and ClickHouse.

Alibaba Cloud Developer

Oct 25, 2021

How PolarDB’s In-Memory Column Index Turns MySQL into a High‑Performance HTAP Engine

Introduction

Analytical databases have become hot in both capital markets and the tech community, driven by the growing demand for data‑driven growth and the evolution of cloud‑native technologies. PolarDB MySQL, originally designed for OLTP, now addresses real‑time analytical workloads with the In‑Memory Column Index (IMCI) solution, achieving hundreds‑fold speedups in complex queries.

1 MySQL‑centric HTAP Solutions

1 Separate OLTP and OLAP Systems

Using two independent systems for OLTP and OLAP with data synchronization offers flexibility but introduces maintenance overhead, consistency challenges, and latency that hampers real‑time analysis.

2 Divergent Design with Multi‑Replica

NewSQL databases such as TiDB adopt a divergent design: one replica stores row data for OLTP, another replica stores columnar data (e.g., TiFlash) for OLAP, enabling a single system to serve both workloads.

3 Integrated Row‑Column Hybrid Storage

Commercial databases (Oracle, SQL Server, DB2) use hybrid storage that combines row and column formats, leveraging columnar I/O efficiency, compression, and CPU cache friendliness while retaining row‑based indexing for transactional workloads.

2 Evolution of PolarDB MySQL AP Capabilities

1 Limitations of MySQL in AP Scenarios

MySQL’s Volcano iterator model incurs deep function calls and poor CPU pipeline utilization.

Execution is largely serial; parallelism is limited to a few simple queries.

Row‑based storage leads to excessive I/O and memory traffic for analytical scans.

2 PolarDB Parallel Query Breakthrough

PolarDB’s Parallel Query framework automatically launches parallel execution when data volume exceeds a threshold, distributing data across threads and aggregating results, dramatically reducing query latency.

3 >Why Column‑Store Is Needed

Columnar storage reads only required columns, achieves high compression (often >10×), and enables block‑level filtering, reducing I/O.

Columnar layout improves CPU cache usage and allows SIMD vectorization, boosting per‑core throughput.

3 PolarDB In‑Memory Column Index

IMCI adds columnar storage and in‑memory computation to PolarDB, allowing a single database instance to handle both TP and AP workloads while preserving OLTP performance.

Key Technical Innovations

Support for columnar indexes on InnoDB tables; indexes are compressed and can reside in memory or on shared storage.

Rewritten column‑oriented execution engine that processes data in 4K batches, uses SIMD, and supports parallel operators.

Cost‑based optimizer that chooses between row‑store, column‑store, and parallel‑query plans.

RO nodes can be dedicated for analytical queries, isolating AP resources from TP workloads.

Hybrid Row‑Column Optimizer

The optimizer evaluates both row‑store and column‑store costs, applying a whitelist and cost model to decide execution mode, with fallback to row‑store when necessary.

Column‑Oriented Execution Engine

IMCI adopts a batch‑oriented Volcano model, where each operator processes a batch of rows, enabling parallelism and SIMD acceleration for operators such as Scan, Join, and Agg.

Column Index as a Secondary Index

Columnar indexes are implemented as secondary indexes in InnoDB, reusing transaction, redo‑log, and replication mechanisms, and allowing DDL to add or drop columnar attributes on tables or columns.

Data Organization

Data is stored in unordered, append‑only RowGroups composed of column‑wise DataPacks.

Active RowGroups accept writes; once full they are frozen, compressed, and written to disk with statistics for pruning.

Updates are handled via delete‑marks and append‑only writes, preserving transactional consistency.

Rough Index Using Statistics

Each DataPack records min/max, sum, null count, and row count; the optimizer uses these statistics to prune irrelevant packs and even answer some aggregates without full scans.

Resource Isolation for TP and AP

RW instance with hybrid storage for light AP queries.

Separate AP‑only RO node with dedicated CPU/memory for heavy analytical workloads.

Standalone standby node with independent shared storage for full CPU, memory, and I/O isolation.

4 Performance Evaluation

TPC‑H benchmarks (100 GB, 22 queries) show IMCI delivering tens to hundreds of times speedup over native MySQL serial execution, with some queries up to 400× faster, and achieving performance comparable to ClickHouse.

Future Work

Automated index recommendation based on query patterns.

Standalone columnar tables and OSS storage to further reduce costs.

Mixed row‑column execution where parts of a plan run on row‑store and parts on column‑store.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

mysql HTAP database optimization Polardb Column Store In-Memory Index

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.