How to Build a Billion-Scale Real-Time Data Warehouse with ClickHouse
This article explains how a large‑scale advertising platform replaced its slow offline data‑warehouse with a ClickHouse‑based real‑time warehouse, covering data source integration, performance comparison, materialized views, projections, schema management, and cost‑effective hot‑cold storage strategies.
Background and Problem Statement
Our advertising platform processes hundreds of billions of daily events. The existing offline warehouse (Athena/Spark/Hadoop) stored data in S3 as CSV, TXT, Parquet, ORC and required >20 seconds for TB‑scale queries, making it unsuitable for real‑time analytics and costly in server resources.
Requirements for a Real‑Time Warehouse
We needed a solution that could:
Fuse multi‑source data (real‑time streams, offline files, auxiliary databases) quickly.
Support low‑latency queries for BI, reporting, and alerting.
Control server costs.
After evaluating options, we identified seven key criteria; points 3 and 4 (fast ingestion and low‑latency analysis) were the hardest.
Choosing ClickHouse over StarRocks
Performance tests on ~50 TB (≈14 billion rows) showed ClickHouse had better query stability and speed. Additional advantages:
Native support for data ingestion from Kafka, S3, HTTP, JDBC.
Federated queries to MySQL/MongoDB.
Real‑time aggregation via materialized views .
Projection‑based pre‑aggregation for faster reads.
Built‑in data‑level permissions.
Data Ingestion Architecture
We ingest data from four sources:
Real‑time Kafka streams → ClickHouse tables.
Offline files → S3 tables or Kafka‑converted streams.
MySQL via JDBC engine.
MongoDB via MongoDB engine.
Key DDL examples:
CREATE TABLE user_table (
id Int32,
user_name String,
height Float32,
password Nullable(String)
) ENGINE JDBC('jdbc:mysql://localhost:3306/?user=root&password=root', 'test', 'test'); SELECT * FROM user_table; INSERT INTO user_table(id, user_name) VALUES (1, 'alex'); CREATE TABLE testdb.test_collection (
id UInt64,
name String
) ENGINE = MongoDB('127.0.0.1:27017', 'testdb', 'test_collection', 'user', 'pwd'); SELECT COUNT() FROM test_collection; CREATE TABLE s3_table (name String, value UInt32)
ENGINE = S3('https://test-datasets.s3.amazonaws.com/my-test-bucket/test-data.csv.gz', 'CSV', 'gzip')
SETTINGS input_format_with_names_use_header = 0; INSERT INTO s3_table VALUES ('a',1),('b',2),('c',3); SELECT * FROM s3_table LIMIT 2; CREATE TABLE kafka_source_test (
level String,
type String,
name String,
time DateTime64
) ENGINE = Kafka SETTINGS kafka_broker_list = 'localhost:9092',
kafka_topic_list = 'test_topic',
kafka_group_name = 'test_group',
kafka_format = 'JSONEachRow',
kafka_num_consumers = 4; SELECT * FROM kafka_source_test LIMIT 5;Real‑Time Analysis Techniques
Two main query patterns:
Report‑driven analysis: ingest‑and‑query simultaneously using materialized views or ClickHouse projection for pre‑aggregated results.
Ad‑hoc analysis: leverage ClickHouse’s vectorized engine to scan TB‑scale tables in seconds.
Materialized Views
We create a base table for raw Kafka data, then a SummingMergeTree table to store daily type counts, and a materialized view that aggregates incoming rows:
CREATE TABLE statistics_type_day (
type String,
day Date,
type_count UInt32
) ENGINE = SummingMergeTree() ORDER BY (type, day); CREATE MATERIALIZED VIEW IF NOT EXISTS statistics_type_day_mv TO statistics_type_day AS
SELECT type, toDate(time) AS day, count() AS type_count FROM kafka_source_test GROUP BY type, day; SELECT * FROM statistics_type_day WHERE day='2022-03-20';Projections
Projections store aggregated data alongside the raw table, enabling the optimizer to answer suitable queries without touching the base table.
CREATE TABLE test_projection_table (
level String,
type String,
name String,
city String,
time DateTime64,
PROJECTION proj_by_level (
SELECT name, count() GROUP BY level
),
PROJECTION proj_by_type (
SELECT name, count() GROUP BY type
)
) ENGINE = MergeTree() ORDER BY (name, level, type); ALTER TABLE test_projection_table ADD PROJECTION proj_by_type_level (
SELECT count(), name GROUP BY type, level
);Queries that match the projection’s SELECT and GROUP BY columns are automatically routed to the projection (visible via EXPLAIN showing ReadFromStorage (MergeTree with projection)).
Metadata Change Detection
ClickHouse cannot auto‑detect schema changes. We built a monitoring service that watches INSERT statements, alerts developers on new columns, and prompts manual schema updates.
Federated Queries
External MySQL or MongoDB tables are accessed directly via ClickHouse’s JDBC or MongoDB engines, allowing seamless joins across heterogeneous data sources.
Cost‑Effective Hot‑Cold Separation
Hot data resides in ClickHouse for low‑latency access, while cold data is offloaded to S3. This strategy keeps storage costs manageable while preserving query performance for recent data.
Conclusion
By leveraging ClickHouse’s materialized views, projections, and native connectors, we built a real‑time data warehouse that meets the platform’s latency, scalability, and cost requirements without the complexity of traditional multi‑layer ODS/DWD/DWS architectures.
Future work includes improving automatic metadata evolution, distributed transaction support, and deeper vectorized computation optimizations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
