Databases 11 min read

From ClickHouse to ByteHouse: Real‑Time Data Analytics Optimization Practices

This article details ByteDance's large‑scale ClickHouse deployment, presents two real‑time analytics use cases—recommendation metrics and ad delivery—and explains the performance bottlenecks encountered and the concrete engineering solutions such as asynchronous indexing, multi‑threaded Kafka Engine, and enhanced Buffer Engine that boosted throughput and ensured data integrity.

Past Memory Big Data

Feb 1, 2023

From ClickHouse to ByteHouse: Real‑Time Data Analytics Optimization Practices

Real‑time Recommendation Metrics

Requirements: simultaneous aggregate and detail queries, support for hundreds of dimensions, fast ID filtering, and machine‑learning metrics such as AUC.

Technology selection

ClickHouse provides low‑latency aggregation and, with skip‑index, acceptable point‑query performance.

ByteDance‑custom ClickHouse adds Map type support for dynamic dimensions.

Built‑in BitSet Bloom filter enables efficient ID filtering.

UDF extensions allow implementation of required ML metric calculations.

Architecture

Recommendation system writes data to Kafka topics. ClickHouse’s built‑in Kafka engine consumes the topics, adapts the schema, and stores the data in tables. An interactive BI platform queries the tables. For fallback, data can be imported from Hive, and a 1 % offline sample is retained for validation.

Write‑throughput bottleneck

Heavy auxiliary skip‑index construction blocks Part creation, limiting write throughput.

Solution: after columns and data files are written, the Part is placed into an asynchronous index‑building queue; a background thread builds the skip‑index files.

Result: write throughput increased by approximately 20 %.

Kafka consumption capacity

The community Kafka table uses a single consumer thread, causing under‑utilisation.

Solution: redesign the Kafka engine to host multiple consumer threads, each with its own consumer handling parsing and insertion, effectively parallelising INSERT operations.

Result: write performance scales close to linearly with the number of threads.

Data integrity in primary‑replica mode

When both replicas write simultaneously, node failures can cause performance degradation or incorrect query results.

Solution: integrate ReplicatedMergeTree’s ZooKeeper‑based leader election into the Kafka engine so that only the elected replica consumes data; the other replica remains standby.

Real‑time Advertising Delivery Data

After migrating a Druid‑based ad‑delivery pipeline to ClickHouse, two issues were observed:

Buffer Engine could not be used together with ReplicatedMergeTree, leading to inconsistent query results across replicas.

ClickHouse lacks transaction support, so a crash during a batch write could cause data loss or duplicate consumption.

Buffer Engine integration

Three tables—Kafka, Buffer, and MergeTree—are combined. The Buffer table is embedded inside the Kafka engine and can be toggled on/off.

Buffer processes Blocks in a pipeline fashion. When ReplicatedMergeTree is used, only one replica holds data in the Buffer table. Queries that reach the non‑consuming replica invoke special logic to read from the other replica’s Buffer table, ensuring consistency.

Atomic offset‑part writes

Kafka offset and Part data are bound together and written atomically. A transaction writes both; on failure the transaction rolls back both offset and Part, then retries consumption.

This guarantees that each batch of inserts is atomic and stabilises consumption.

Conclusion

Optimisations—async skip‑index construction, multi‑threaded Kafka consumption, ZooKeeper‑based leader election for ReplicatedMergeTree, and Buffer Engine enhancements with atomic offset‑part writes—extend ClickHouse’s real‑time analytics capabilities to ByteHouse, supporting large‑scale, fast‑growing data workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

real-time analytics ClickHouse ReplicatedMergeTree ByteHouse Kafka Engine Buffer Engine asynchronous indexing

Written by

Past Memory Big Data

A popular big-data architecture channel with over 100,000 developers. Publishes articles on Spark, Hadoop, Flink, Kafka and more. Visit the Past Memory Big Data blog at https://www.iteblog.com. Search "Past Memory" on Google or Baidu.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.