Doris vs ClickHouse: Which Database Delivers Faster Writes and Queries?
This article presents a systematic performance comparison between Doris and ClickHouse, covering data ingestion speed, SQL syntax differences, hardware impact, and detailed query benchmarks across multiple scenarios, ultimately revealing that each system excels in different use cases.
1. Introduction
A community request sparked a formal benchmark comparing Doris and ClickHouse on the same dataset, focusing not only on query speed but also on SQL syntax differences between the two database engines.
2. Test Environment Setup
The same source data (≈24.6 million rows, 13 columns) was generated via Spark and written to:
A Doris Duplicate table named logs_from_spark01 A ClickHouse MergeTree table named logs_from_spark_cluster Both tables were distributed across three worker nodes with identical column definitions to ensure a fair comparison.
CREATE TABLE `logs_from_spark01` (
`client_ip` varchar(200),
`nature` varchar(20),
`province` varchar(30),
`city` varchar(30),
`operator` varchar(20),
`domain` varchar(200),
`time` varchar(200),
`target_ip` text,
`rcode` int(11),
`query_type` int(11),
`authority_record` varchar(500),
`add_msg` varchar(200),
`dns_ip` varchar(200)
) ENGINE=OLAP DUPLICATE KEY(`client_ip`,`nature`,`province`,`city`) COMMENT 'dns logs' DISTRIBUTED BY HASH(`client_ip`) BUCKETS AUTO; CREATE TABLE cluster_db.logs_from_spark_cluster ON CLUSTER template_3shard_2replica (
`client_ip` String,
`nature` String,
`province` String,
`city` String,
`operator` String,
`domain` String,
`time` String,
`target_ip` String,
`rcode` Int32,
`query_type` Int32,
`authority_record` String,
`add_msg` String,
`dns_ip` String
) ENGINE = Distributed('template_3shard_2replica','cluster_db','logs_from_spark_cluster_local',cityHash64(client_ip));
CREATE TABLE cluster_db.logs_from_spark_cluster_local ON CLUSTER template_3shard_2replica (
`client_ip` String,
`nature` String,
`province` String,
`city` String,
`operator` String,
`domain` String,
`time` String,
`target_ip` String,
`rcode` Int32,
`query_type` Int32,
`authority_record` String,
`add_msg` String,
`dns_ip` String
) ENGINE = MergeTree ORDER BY (client_ip, nature, province, city);3. Hardware Considerations
Both clusters consist of three nodes, but the Doris machines have significantly better CPU and memory resources (the article likens Doris to a Ferrari and ClickHouse to a BMW 3‑Series). Therefore, raw speed numbers must be interpreted in light of this hardware disparity.
4. Development Environment
The ingestion pipeline uses Spark Structured Streaming to read from Kafka and write to the two databases. Doris employs its native stream load API, while ClickHouse uses a custom JDBC sink. The full Spark code is omitted but referenced in earlier posts.
5. Data Write Performance
Inserting the same dataset took about 1.5 hours with Doris stream load and more than 3 hours with ClickHouse’s JDBC approach, giving Doris a clear advantage in write throughput.
6. Development Issues
A common pitfall was a jar conflict: the IDE pulled hadoop‑hdfs 2.7 while the cluster ran Hadoop 3.1. Excluding the older Hadoop jar resolved the errors.
7. Query Performance Benchmarks
7.1 Count distinct domains where target_ip is empty
ClickHouse can run a simple query, but Doris requires a nested query because its lower() function does not aggregate correctly in a single step.
SELECT lower(domain) AS domain, count(domain) AS cnt
FROM logs_from_spark_cluster
WHERE target_ip = '""'
GROUP BY domain;Result (ClickHouse):
Result (Doris – nested query):
Doris needed the extra nesting, making the query less convenient.
7.2 Top‑100 client IPs by access count
Both databases can share the same SQL:
SELECT t1.client_ip, t2.nature, t2.province, t2.city, t1.cnt
FROM (
SELECT client_ip, count(client_ip) AS cnt
FROM logs_from_spark01
GROUP BY client_ip
ORDER BY cnt DESC
LIMIT 100
) t1
JOIN logs_from_spark01 t2 ON t1.client_ip = t2.client_ip
GROUP BY client_ip, nature, province, city, cnt;Result screenshots are provided for both Doris and ClickHouse.
7.3 Count distinct on low‑cardinality field ( client_ip )
Both engines return the result in under one second, making them virtually indistinguishable for this workload.
7.4 Count distinct on high‑cardinality field ( domain )
Both systems slow down noticeably; after accounting for hardware differences, performance is roughly comparable.
7.5 Complex query – longest consecutive minutes per client IP
ClickHouse SQL (≈10 s):
SELECT client_ip, max(row_num2) AS max
FROM (
SELECT client_ip, row_num2, date_min
FROM (
SELECT client_ip, sub_date,
row_number() OVER (PARTITION BY client_ip, sub_date) AS row_num2,
date_min
FROM (
SELECT client_ip, date_min,
row_number() OVER (PARTITION BY client_ip ORDER BY date_min) AS row_num,
subtractMinutes(date_min, row_num) AS sub_date
FROM (
SELECT client_ip,
toStartOfMinute(parseDateTimeBestEffort(time)) AS date_min,
row_number() OVER (PARTITION BY client_ip, date_min ORDER BY date_min) AS row_num
FROM logs_from_spark_cluster
WHERE isIPv4String(client_ip)=1 AND length(time)=14 AND time LIKE '20220730%'
) A
WHERE A.row_num = 1
) B
) C
) D
GROUP BY client_ip
ORDER BY max DESC
LIMIT 100;Doris SQL (≈17 s) is longer and more nested:
SELECT client_ip, max(row_num2) AS max
FROM (
SELECT client_ip, row_num2, date_min
FROM (
SELECT client_ip, sub_date,
row_number() OVER (PARTITION BY client_ip, sub_date) AS row_num2,
date_min
FROM (
SELECT client_ip, date_min,
row_number() OVER (PARTITION BY client_ip ORDER BY date_min) AS row_num,
minutes_sub(to_date(date_min), row_num) AS sub_date
FROM (
SELECT client_ip,
minute_floor(time) AS date_min,
row_number() OVER (PARTITION BY client_ip, date_min ORDER BY date_min) AS row_num
FROM logs_from_spark01
WHERE length(time)=14 AND time LIKE '20220730%'
) A
WHERE A.row_num = 1
) B
) C
) D
GROUP BY client_ip
ORDER BY max DESC
LIMIT 100;Result images for both systems are included.
7.6 Simple query – list all distinct values of location and operator columns
The same SQL works on both platforms:
SELECT nature, province, city, operator
FROM logs_from_spark01
GROUP BY nature, province, city, operator;ClickHouse finishes faster than Doris for this case.
8. Overall Conclusions
Even though the two clusters run on different hardware, the benchmark shows:
Doris’s stream load ingestion is dramatically faster than ClickHouse’s JDBC approach.
For low‑cardinality count‑distinct queries, both databases perform similarly.
For high‑cardinality count‑distinct and some complex analytical queries, ClickHouse can be noticeably faster despite weaker hardware.
SQL convenience varies: ClickHouse often allows more concise statements, while Doris sometimes requires nested queries.
The key takeaway is that performance is workload‑dependent; one should choose the engine that aligns with the specific query patterns and hardware resources rather than assuming a universal winner.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
