HiStore: A High‑Compression Columnar Database for Massive OLAP Workloads
HiStore is a columnar database developed by Alibaba's middleware team, designed for massive OLAP workloads with high compression ratios, low storage and maintenance costs, supporting ad‑hoc multi‑dimensional queries, knowledge‑grid optimization, efficient data loading, and offering features such as approximate queries and high‑availability clustering.
Overview
HiStore is a column‑oriented database created by Alibaba's middleware team. It targets massive data scenarios, offering high compression (average >10:1, up to 40:1), low storage and maintenance costs, and fast OLAP query performance on commodity X86 servers.
Key Technical Features
Supports ad‑hoc multi‑dimensional aggregation queries on billions of rows with second‑level response times.
Fully compatible with MySQL protocol and SQL syntax, allowing seamless migration of existing MySQL tools.
Iterative development has added bulk data load, concurrent queries, and data block replication, outperforming peers such as InfiniDB and Infobright.
Low storage cost through column‑wise high‑ratio compression.
Minimal migration effort; no additional dependencies.
Applicable Scenarios
Log and event management systems.
Data warehouses and data marts requiring low‑cost storage and high‑performance ingestion.
Real‑time statistical analysis for decision making.
Large‑scale analytics for mobile app data, marketing, and advertising.
IoT sensor data collection and processing.
Enterprise OLAP applications such as reporting, BI, and decision support.
Comparable Products
Infobright, InfiniDB, Pivotal Greenplum, Amazon RedShift, Teradata, HP Vertica, SAP HANA, IBM Netezza, Huawei GaussDB, DM7, etc.
Architecture
Engine
HiStore uses a knowledge‑grid‑based SMP‑optimized columnar engine designed for analytical workloads. Data is stored in fixed‑size data blocks (DN) and organized via a Knowledge Grid (KG) that drives query planning, compression, and execution.
Column‑Based Storage vs. Row‑Based Storage
Traditional row‑oriented engines store entire records together, leading to high I/O for ad‑hoc queries. HiStore stores each column separately, allowing queries to read only the needed columns, dramatically reducing I/O and improving response time for large‑scale analytics.
Knowledge Grid (KG)
The KG consists of Metadata Nodes (MD) and Knowledge Nodes (KN). MD stores aggregate statistics (MIN, MAX, SUM, COUNT, null flags) for each data block, enabling many queries to be answered without reading raw data. KN holds column type, range bitmaps, and other statistics that guide block selection and compression.
Compute Engine
The optimizer uses KG information to build a rough set of relevant data nodes, avoiding unrelated blocks. If a query can be satisfied from MD (e.g., COUNT, MAX), the engine returns results without accessing physical data.
SELECT count(*) FROM employees WHERE salary < 2500
Execution Engine
Parses execution plans, manages I/O thread pools, and handles memory allocation.
Supports transaction logging, SMP‑based concurrent queries, and physical file management.
High‑Efficiency Compression
Compression is column‑type aware: PPM for strings, predictive range coding for numeric types, and custom algorithms for specific patterns (e.g., IP, URLs). Fixed‑size data blocks (≈128 KB) maximize compression throughput and reduce storage.
Data Import and Pre‑Processing
HiStore provides an external import client that preprocesses heterogeneous sources (HBase, DB, etc.) outside the engine, performing compression and KG construction before loading, achieving up to 2 TB/hour ingestion rates.
Approximate Query
For workloads tolerant of minor inaccuracies (e.g., top‑N queries), KG statistics enable approximate query processing that skips irrelevant data blocks, further accelerating response.
Future Roadmap
Hybrid engine combining row‑engine for hot data and column‑engine for cold data.
Automatic data integrity verification.
Support for external import from HDFS/ODPS.
Online KG management and rebuilding tools.
Conclusion
HiStore delivers a low‑cost, low‑maintenance OLAP storage engine with high compression, knowledge‑grid‑driven query optimization, and scalability for petabyte‑scale analytics, positioning it as a competitive columnar solution in the big‑data ecosystem.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
