Big Data 12 min read

How ByteDance Scales ClickHouse to 15,000 Nodes and 600 PB: Big Data Lessons

In an InfoQ interview, ByteDance’s data platform leader explains how the company operates over 15,000 ClickHouse nodes storing more than 600 PB of data, the architectural choices behind its analytics workloads, cloud‑native transformations, and the practical lessons learned from large‑scale deployments.

Volcano Engine Developer Services

Nov 10, 2021

How ByteDance Scales ClickHouse to 15,000 Nodes and 600 PB: Big Data Lessons

ByteDance Data Application Product

ByteDance, the largest domestic ClickHouse user, runs over 15,000 ClickHouse nodes managing more than 600 PB of data, with the biggest cluster exceeding 2,400 nodes. Most of the company’s growth‑driven analytics rely on ClickHouse as the core query engine.

Guo Dongdong, head of data platform applications, discusses the business scenarios where ClickHouse is used, why it was chosen over other technologies, and the pitfalls encountered during its adoption.

Q: How did your experience building big‑data platforms at Qihoo 360 differ from ByteDance?

Guo Dongdong: At 360 the focus was on introducing Hadoop, HBase and other ecosystem tools to replace legacy databases. At ByteDance the ecosystem is mature, so the emphasis shifted to systematic platform construction and scaling the analysis engine to meet massive data growth.

Q: What is the relationship between your data‑application product and the Volcano Engine data‑mid platform?

Guo Dongdong: The mid‑platform standardizes and processes data, while the data‑application layer builds business‑facing capabilities such as BI, A/B testing, behavior analysis, and visualization on top of that foundation.

Q: How do you iterate on data‑application products?

Guo Dongdong: We follow an agile development cycle of two to three weeks, delivering small, fast increments that turn user requirements into product features, supported by a comprehensive demand‑management and R&D governance system.

Q: Can you describe the tech stack behind a typical data‑application product?

Guo Dongdong: Using the A/B testing platform as an example, the architecture includes metric construction, data‑sharding, and a core query engine. The stack combines containerized deployment with languages such as Python and Go.

Q: Why has ClickHouse gained popularity only recently?

Guo Dongdong: Open‑source technologies need time to mature and be adopted. Large‑scale usage by companies like ByteDance accelerates this process, and ClickHouse’s strong analytical capabilities address the need for fine‑grained data analysis.

Q: Which business scenarios at ByteDance rely on ClickHouse?

Guo Dongdong: ClickHouse underpins many downstream capabilities, including BI, A/B testing analytics, behavior analysis, and advertising effectiveness measurement.

Q: What factors led you to select ClickHouse over alternatives?

Guo Dongdong: After evaluating Presto, Kylin, Druid and others, ClickHouse stood out for its superior query performance in fixed‑panel scenarios, flexibility, simple yet efficient execution engine, and vectorized execution.

Q: How has the ClickHouse solution evolved internally?

Guo Dongdong: We have scaled single‑node capacity to thousands of machines, optimized complex query paths, and performed a cloud‑native transformation that separates storage and compute, adds container‑based scaling, and enhances transaction and real‑time write capabilities.

Q: What operational challenges did you encounter?

Guo Dongdong: Early ClickHouse deployments lacked robust ops tooling, so we built a comprehensive management system for node health, failover, and data ingestion, as well as improvements for real‑time capabilities.

Q: What trends have you observed in big‑data analytics over the past few years?

Guo Dongdong: Open‑source solutions like ClickHouse and cloud‑native data warehouses (e.g., Snowflake) have driven rapid innovation, with increasing emphasis on AI‑enabled analytics, lake‑house architectures, and the separation of storage and compute.

Q: How do Lambda and Kappa architectures coexist at ByteDance?

Guo Dongdong: Both are used: Lambda for large‑scale, complex offline‑plus‑realtime scenarios such as anti‑fraud, and Kappa for real‑time data lake workloads where a unified pipeline suffices, resorting to offline recomputation only for major data corrections.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Analytics ClickHouse ByteDance

Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.