Databases 19 min read

How ByteHouse Achieves Hundred‑Fold OLAP Performance Gains

ByteHouse, a cloud‑native data warehouse built on ClickHouse, redesigns storage‑compute separation, introduces a new MPP architecture, rule‑based and cost‑based optimizers, exchange runtime filters, and parallelism techniques, delivering 10‑200× faster query performance on TPC‑DS, TPC‑H and SSB benchmarks and boosting point‑lookup QPS to 32,000.

Past Memory Big Data

Apr 19, 2024

How ByteHouse Achieves Hundred‑Fold OLAP Performance Gains

ByteHouse is a cloud‑native data warehouse developed by ByteDance, extending the open‑source ClickHouse engine with a redesigned architecture that separates storage and compute, supports multi‑tenant management, and improves scalability, stability, and resource utilization.

Core Architectural Enhancements

Storage‑compute separation that minimizes performance loss while allowing independent scaling of storage and compute layers.

New generation MPP architecture combining shared‑nothing compute with shared‑everything storage, avoiding re‑sharding issues while retaining parallel processing.

ANSI‑SQL 2011 compliance with 100% pass rate on the TPC‑DS test suite.

Support for Python UDF/UDAF (Java UDF/UDAF under development) and a self‑developed cost‑based optimizer.

Complex Query Optimizations

Rule‑Based Optimizer (RBO) implements column pruning, partition pruning, expression simplification, sub‑query decorrelation, predicate push‑down, redundant operator elimination, outer‑join to inner‑join conversion, operator push‑down to storage, and distributed operator splitting. Compared with the community ClickHouse, ByteHouse fully supports sub‑query decorrelation, enabling all TPC‑DS queries to run.

For non‑equijoin cases, ByteHouse performs the non‑equijoin predicate directly within the join operator, achieving roughly a 2× speedup over the traditional outer‑join‑then‑filter approach.

RBO also optimizes multiple COUNT(DISTINCT) calculations by replicating data to increase parallelism.

Cost‑Based Optimizer (CBO) uses a cascade search framework to generate physical plans while searching for the optimal solution. Join order enumeration is accelerated via join‑graph partitioning, and cost estimation relies on statistics.

For join recorder problems involving up to ten tables, ByteHouse enumerates optimal plans within seconds; larger joins fall back to heuristic methods, and mixed outer‑semi‑anti‑join reordering is supported.

CTE handling includes inline, shared, and partial‑inline cost calculations to find optimal plans, and a magic‑set placement technique pushes filters into aggregation to reduce hotspot computation.

Distributed Plan Generation merges the traditional two‑stage planning (single‑node plan then distributed plan) into a single phase that expands all distributed plans first and then selects the global‑cost optimal solution, reducing shuffle overhead by leveraging table metadata and data distribution.

The generated physical plan is split into multiple plan segments; data transfer between segments uses a newly introduced exchange module with a data‑transfer layer (in‑process, cross‑process via BRPC stream, ordered status codes, compression, connection‑pool reuse) and an operator layer (broadcast, repetition, gather, round‑robin, etc.). Runtime filters are built dynamically during the hash‑join probe phase to prune irrelevant rows early.

Wide‑Table Query Optimizations

ByteHouse introduces a global dictionary that encodes variable‑length strings into fixed‑length integers, allowing both aggregation and exchange operators to compute directly on encoded values across nodes.

Zero‑copy techniques reduce deep‑copy overhead, improving memory‑bandwidth utilization.

Optimizations to the uncompress cache mitigate lock contention when multiple threads access the cache concurrently.

Performance Benchmarks

Using 100 GB SSB, TPC‑H, and TPC‑DS datasets, ByteHouse is compared against a leading open‑source OLAP engine.

TPC‑DS 100 GB : Across 99 queries, ByteHouse is more than 6× faster overall; for the subset where both engines complete, the time gap reaches 15.7×, with some queries (Q53, Q63, Q82) showing ~200× speedup.

TPCH 100 GB : For the 22 queries, ByteHouse outperforms the competitor by over 100× on queries such as Q3, Q5, and Q7.

SSB 100 GB (wide‑table) : ByteHouse achieves 3.6× higher throughput on wide‑table queries and significantly outperforms the competitor on multi‑table star‑schema joins.

High‑Concurrency Point‑Lookup Optimizations

ByteHouse addresses bottlenecks in index computation, point‑lookup read amplification, long execution chains, and lock contention.

Short‑circuit Execution Plan : For point‑lookup queries, ByteHouse generates a simplified plan that pushes LIMIT down, removes redundant predicates, and keeps only essential operators, merging the two‑stage plan into a single segment sent to the node that holds the target data.

Unique‑Table Point‑Lookup Index : An in‑memory KV structure maps predicate values to row identifiers, enabling direct row access without full table scans.

Efficient Read Path : Lightweight partition pruning combined with unique‑table filtering determines the exact marks to read; a Min‑Max index on marks allows LIMIT‑driven early termination. Column‑store data benefits from a bucket cache that reduces lock contention, while row‑store data uses a dedicated cache layer.

Prepared statements further reduce parser and plan‑building overhead.

QPS Improvements

After lock and pipeline optimizations: 5,500–8,000 QPS.

With the simplified execution plan: 18,000 QPS.

Removing AST parsing and plan‑build time: 32,000 QPS.

Test environment: 32‑core CPU, 128 GB RAM, 1 TB SSD, processing 100 million rows.

Conclusion

Through a series of architectural redesigns and optimizer enhancements—RBO, CBO, exchange runtime filters, global dictionary, zero‑copy, and point‑lookup indexing—ByteHouse delivers order‑of‑magnitude performance gains for real‑time data warehousing, complex queries, wide‑table analytics, and high‑concurrency point‑lookups, enabling faster, more resource‑efficient analytics across diverse workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance optimization Data Warehouse OLAP benchmark Query Engine cost-based optimizer ByteHouse

Written by

Past Memory Big Data

A popular big-data architecture channel with over 100,000 developers. Publishes articles on Spark, Hadoop, Flink, Kafka and more. Visit the Past Memory Big Data blog at https://www.iteblog.com. Search "Past Memory" on Google or Baidu.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.