Databases 19 min read

HiStore: A High‑Compression Columnar Database for Massive OLAP Workloads

HiStore is a columnar database developed by Alibaba's middleware team, designed for massive OLAP workloads with high compression ratios, low storage and maintenance costs, supporting ad‑hoc multi‑dimensional queries, knowledge‑grid optimization, efficient data loading, and offering features such as approximate queries and high‑availability clustering.

Architecture Digest

Jul 3, 2016

HiStore: A High‑Compression Columnar Database for Massive OLAP Workloads

Overview

HiStore is a column‑oriented database created by Alibaba's middleware team. It targets massive data scenarios, offering high compression (average >10:1, up to 40:1), low storage and maintenance costs, and fast OLAP query performance on commodity X86 servers.

Key Technical Features

Supports ad‑hoc multi‑dimensional aggregation queries on billions of rows with second‑level response times.

Fully compatible with MySQL protocol and SQL syntax, allowing seamless migration of existing MySQL tools.

Iterative development has added bulk data load, concurrent queries, and data block replication, outperforming peers such as InfiniDB and Infobright.

Low storage cost through column‑wise high‑ratio compression.

Minimal migration effort; no additional dependencies.

Applicable Scenarios

Log and event management systems.

Data warehouses and data marts requiring low‑cost storage and high‑performance ingestion.

Real‑time statistical analysis for decision making.

Large‑scale analytics for mobile app data, marketing, and advertising.

IoT sensor data collection and processing.

Enterprise OLAP applications such as reporting, BI, and decision support.

Comparable Products

Infobright, InfiniDB, Pivotal Greenplum, Amazon RedShift, Teradata, HP Vertica, SAP HANA, IBM Netezza, Huawei GaussDB, DM7, etc.

Architecture

Engine

HiStore uses a knowledge‑grid‑based SMP‑optimized columnar engine designed for analytical workloads. Data is stored in fixed‑size data blocks (DN) and organized via a Knowledge Grid (KG) that drives query planning, compression, and execution.

Column‑Based Storage vs. Row‑Based Storage

Traditional row‑oriented engines store entire records together, leading to high I/O for ad‑hoc queries. HiStore stores each column separately, allowing queries to read only the needed columns, dramatically reducing I/O and improving response time for large‑scale analytics.

Knowledge Grid (KG)

The KG consists of Metadata Nodes (MD) and Knowledge Nodes (KN). MD stores aggregate statistics (MIN, MAX, SUM, COUNT, null flags) for each data block, enabling many queries to be answered without reading raw data. KN holds column type, range bitmaps, and other statistics that guide block selection and compression.

Compute Engine

The optimizer uses KG information to build a rough set of relevant data nodes, avoiding unrelated blocks. If a query can be satisfied from MD (e.g., COUNT, MAX), the engine returns results without accessing physical data.

SELECT count(*) FROM employees WHERE salary < 2500

Execution Engine

Parses execution plans, manages I/O thread pools, and handles memory allocation.

Supports transaction logging, SMP‑based concurrent queries, and physical file management.

High‑Efficiency Compression

Compression is column‑type aware: PPM for strings, predictive range coding for numeric types, and custom algorithms for specific patterns (e.g., IP, URLs). Fixed‑size data blocks (≈128 KB) maximize compression throughput and reduce storage.

Data Import and Pre‑Processing

HiStore provides an external import client that preprocesses heterogeneous sources (HBase, DB, etc.) outside the engine, performing compression and KG construction before loading, achieving up to 2 TB/hour ingestion rates.

Approximate Query

For workloads tolerant of minor inaccuracies (e.g., top‑N queries), KG statistics enable approximate query processing that skips irrelevant data blocks, further accelerating response.

Future Roadmap

Hybrid engine combining row‑engine for hot data and column‑engine for cold data.

Automatic data integrity verification.

Support for external import from HDFS/ODPS.

Online KG management and rebuilding tools.

Conclusion

HiStore delivers a low‑cost, low‑maintenance OLAP storage engine with high compression, knowledge‑grid‑driven query optimization, and scalability for petabyte‑scale analytics, positioning it as a competitive columnar solution in the big‑data ecosystem.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

OLAP high performance data compression Columnar Database knowledge grid

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.