Achieving Billion‑Row Second‑Level Queries with ClickHouse Real‑Time Engine
JD’s Algorithmic Intelligence team built a ClickHouse‑based real‑time analytics engine that ingests Kafka and offline data, uses MergeTree tables with strategic partitioning and sorting, and employs batch writes, materialized views, and monitoring to achieve second‑level queries over billions of rows.
Introduction
JD’s Algorithmic Intelligence department built a real‑time analytics engine on ClickHouse to provide up‑to‑date online data and early warnings. The engine aggregates resource‑position data with second‑level query latency while handling tens of billions of rows.
ClickHouse Overview
ClickHouse is a column‑oriented DBMS originally developed by Yandex for online traffic analysis. Columnar storage enables high compression and fast scans, allowing sub‑second queries on tables with hundreds of trillions of rows. Key capabilities include full DBMS operations, dynamic real‑time aggregation, and rich SQL support with many table engines.
Typical Workloads
The engine is suited for read‑heavy scenarios where data is written in bulk and rarely updated or deleted. Common use cases are BI for advertising, app traffic, IoT, and click‑stream monitoring.
MergeTree Engine
MergeTree (and its variants) provides primary‑key indexing and partitioning. A typical partition definition uses a single date column, e.g. toYYYYMMDD(event_ts), which limits the amount of data scanned. The ORDER BY clause defines the sorting key, e.g. ORDER BY (button_id, product_id), ensuring rows with the same key are stored adjacently for fast aggregation.
Physical storage hierarchy:
Primary index files (sparse index kept in memory)
Compressed column files
Column mark files (offsets for each column block)
Data Write Process
Each incoming batch creates a new partition directory; identical partitions are merged asynchronously. The write generates a primary index, column mark files, and compressed column files.
Query Execution
Queries use WHERE clauses that prune partitions, then the primary index locates row ranges, the mark files identify the needed column blocks, and only those blocks are decompressed.
If a query cannot use the index, ClickHouse scans all partitions, which can overload the cluster.
Table Design and Partitioning
Design tables based on query patterns. Choose a partition field (commonly a time field) and a sorting key that matches frequent aggregation dimensions.
CREATE TABLE resource_click (
event_ts DateTime,
button_id String,
product_id UInt32,
extra_info String
) ENGINE = ReplicatedMergeTree('/clickhouse/ck.test/tables/{layer}-{shard}/resource_click', '{replica}')
PARTITION BY toYYYYMMDD(event_ts)
ORDER BY (button_id, product_id);Proper partitioning and sorting enable sub‑second aggregation even with billions of rows.
Batch Insertion with JDBC
Batch writes reduce cluster load. Example Maven dependency and Java code:
<dependency>
<groupId>ru.yandex.clickhouse</groupId>
<artifactId>clickhouse-jdbc</artifactId>
<version>0.2.4</version>
</dependency> Class.forName(Ck.DRIVER);
Connection conn = DriverManager.getConnection(Ck.URL, Ck.USERNAME, Ck.PASSWORD);
conn.setAutoCommit(false);
PreparedStatement stmt = conn.prepareStatement(INSERT_SQL);
for (/* each batch */) {
stmt.set...(index, value);
stmt.addBatch();
}
stmt.executeBatch();
conn.commit();Query Optimization and Materialized Views
When data grows to hundreds of billions of rows, simple partitioning may be insufficient. Optimizations include:
Identifying slow SQL (e.g., heavy distinct‑count queries) and rewriting them.
Creating materialized views that pre‑aggregate data, trading storage for query speed.
Example materialized view that aggregates PV and UV per hour:
CREATE MATERIALIZED VIEW app_btn_event_mv ON CLUSTER test_cluster
ENGINE = ReplicatedAggregatingMergeTree('/clickhouse/ck.test/tables/{layer}-{shard}/app_btn_event_mv', '{replica}')
PARTITION BY toYYYYMMDD(event_time)
ORDER BY (button_id, toStartOfHour(event_time))
TTL event_time + toIntervalDay(3)
SETTINGS index_granularity = 8192 AS
SELECT toStartOfHour(event_time) AS hour,
button_id,
countState(uid) AS pv,
uniqState(uid) AS uv
FROM raw_btn_events
GROUP BY button_id, hour;Querying the view is fast because aggregation is already computed.
Monitoring and Alerts
A monitoring system tracks data ingestion rates, write errors (e.g., NullPointerException, ArrayIndexOutOfBoundsException), and cluster health metrics (CPU, memory, disk). Alerts are configured for Kafka backlog, operator failures, and ClickHouse resource thresholds via Grafana dashboards.
Conclusion
Understanding ClickHouse’s columnar storage, MergeTree mechanics, and query behavior enables the design of tables and SQL that achieve sub‑second aggregation on tens of billions of rows. Ongoing tuning, materialized views, and robust monitoring are essential for maintaining performance at the scale of hundreds of billions to a trillion rows.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
