Databases 15 min read

Doris vs ClickHouse: Which Database Delivers Faster Writes and Queries?

This article presents a systematic performance comparison between Doris and ClickHouse, covering data ingestion speed, SQL syntax differences, hardware impact, and detailed query benchmarks across multiple scenarios, ultimately revealing that each system excels in different use cases.

dbaplus Community

Oct 18, 2023

Doris vs ClickHouse: Which Database Delivers Faster Writes and Queries?

1. Introduction

A community request sparked a formal benchmark comparing Doris and ClickHouse on the same dataset, focusing not only on query speed but also on SQL syntax differences between the two database engines.

2. Test Environment Setup

The same source data (≈24.6 million rows, 13 columns) was generated via Spark and written to:

A Doris Duplicate table named logs_from_spark01 A ClickHouse MergeTree table named logs_from_spark_cluster Both tables were distributed across three worker nodes with identical column definitions to ensure a fair comparison.

CREATE TABLE `logs_from_spark01` (
  `client_ip` varchar(200),
  `nature` varchar(20),
  `province` varchar(30),
  `city` varchar(30),
  `operator` varchar(20),
  `domain` varchar(200),
  `time` varchar(200),
  `target_ip` text,
  `rcode` int(11),
  `query_type` int(11),
  `authority_record` varchar(500),
  `add_msg` varchar(200),
  `dns_ip` varchar(200)
) ENGINE=OLAP DUPLICATE KEY(`client_ip`,`nature`,`province`,`city`) COMMENT 'dns logs' DISTRIBUTED BY HASH(`client_ip`) BUCKETS AUTO;

CREATE TABLE cluster_db.logs_from_spark_cluster ON CLUSTER template_3shard_2replica (
  `client_ip` String,
  `nature` String,
  `province` String,
  `city` String,
  `operator` String,
  `domain` String,
  `time` String,
  `target_ip` String,
  `rcode` Int32,
  `query_type` Int32,
  `authority_record` String,
  `add_msg` String,
  `dns_ip` String
) ENGINE = Distributed('template_3shard_2replica','cluster_db','logs_from_spark_cluster_local',cityHash64(client_ip));

CREATE TABLE cluster_db.logs_from_spark_cluster_local ON CLUSTER template_3shard_2replica (
  `client_ip` String,
  `nature` String,
  `province` String,
  `city` String,
  `operator` String,
  `domain` String,
  `time` String,
  `target_ip` String,
  `rcode` Int32,
  `query_type` Int32,
  `authority_record` String,
  `add_msg` String,
  `dns_ip` String
) ENGINE = MergeTree ORDER BY (client_ip, nature, province, city);

3. Hardware Considerations

Both clusters consist of three nodes, but the Doris machines have significantly better CPU and memory resources (the article likens Doris to a Ferrari and ClickHouse to a BMW 3‑Series). Therefore, raw speed numbers must be interpreted in light of this hardware disparity.

4. Development Environment

The ingestion pipeline uses Spark Structured Streaming to read from Kafka and write to the two databases. Doris employs its native stream load API, while ClickHouse uses a custom JDBC sink. The full Spark code is omitted but referenced in earlier posts.

5. Data Write Performance

Inserting the same dataset took about 1.5 hours with Doris stream load and more than 3 hours with ClickHouse’s JDBC approach, giving Doris a clear advantage in write throughput.

6. Development Issues

A common pitfall was a jar conflict: the IDE pulled hadoop‑hdfs 2.7 while the cluster ran Hadoop 3.1. Excluding the older Hadoop jar resolved the errors.

7. Query Performance Benchmarks

7.1 Count distinct domains where target_ip is empty

ClickHouse can run a simple query, but Doris requires a nested query because its lower() function does not aggregate correctly in a single step.

SELECT lower(domain) AS domain, count(domain) AS cnt
FROM logs_from_spark_cluster
WHERE target_ip = '""'
GROUP BY domain;

Result (ClickHouse):

Result (Doris – nested query):

Doris needed the extra nesting, making the query less convenient.

7.2 Top‑100 client IPs by access count

Both databases can share the same SQL:

SELECT t1.client_ip, t2.nature, t2.province, t2.city, t1.cnt
FROM (
  SELECT client_ip, count(client_ip) AS cnt
  FROM logs_from_spark01
  GROUP BY client_ip
  ORDER BY cnt DESC
  LIMIT 100
) t1
JOIN logs_from_spark01 t2 ON t1.client_ip = t2.client_ip
GROUP BY client_ip, nature, province, city, cnt;

Result screenshots are provided for both Doris and ClickHouse.

7.3 Count distinct on low‑cardinality field ( client_ip )

Both engines return the result in under one second, making them virtually indistinguishable for this workload.

7.4 Count distinct on high‑cardinality field ( domain )

Both systems slow down noticeably; after accounting for hardware differences, performance is roughly comparable.

7.5 Complex query – longest consecutive minutes per client IP

ClickHouse SQL (≈10 s):

SELECT client_ip, max(row_num2) AS max
FROM (
  SELECT client_ip, row_num2, date_min
  FROM (
    SELECT client_ip, sub_date,
           row_number() OVER (PARTITION BY client_ip, sub_date) AS row_num2,
           date_min
    FROM (
      SELECT client_ip, date_min,
             row_number() OVER (PARTITION BY client_ip ORDER BY date_min) AS row_num,
             subtractMinutes(date_min, row_num) AS sub_date
      FROM (
        SELECT client_ip,
               toStartOfMinute(parseDateTimeBestEffort(time)) AS date_min,
               row_number() OVER (PARTITION BY client_ip, date_min ORDER BY date_min) AS row_num
        FROM logs_from_spark_cluster
        WHERE isIPv4String(client_ip)=1 AND length(time)=14 AND time LIKE '20220730%'
      ) A
      WHERE A.row_num = 1
    ) B
  ) C
) D
GROUP BY client_ip
ORDER BY max DESC
LIMIT 100;

Doris SQL (≈17 s) is longer and more nested:

SELECT client_ip, max(row_num2) AS max
FROM (
  SELECT client_ip, row_num2, date_min
  FROM (
    SELECT client_ip, sub_date,
           row_number() OVER (PARTITION BY client_ip, sub_date) AS row_num2,
           date_min
    FROM (
      SELECT client_ip, date_min,
             row_number() OVER (PARTITION BY client_ip ORDER BY date_min) AS row_num,
             minutes_sub(to_date(date_min), row_num) AS sub_date
      FROM (
        SELECT client_ip,
               minute_floor(time) AS date_min,
               row_number() OVER (PARTITION BY client_ip, date_min ORDER BY date_min) AS row_num
        FROM logs_from_spark01
        WHERE length(time)=14 AND time LIKE '20220730%'
      ) A
      WHERE A.row_num = 1
    ) B
  ) C
) D
GROUP BY client_ip
ORDER BY max DESC
LIMIT 100;

Result images for both systems are included.

7.6 Simple query – list all distinct values of location and operator columns

The same SQL works on both platforms:

SELECT nature, province, city, operator
FROM logs_from_spark01
GROUP BY nature, province, city, operator;

ClickHouse finishes faster than Doris for this case.

8. Overall Conclusions

Even though the two clusters run on different hardware, the benchmark shows:

Doris’s stream load ingestion is dramatically faster than ClickHouse’s JDBC approach.

For low‑cardinality count‑distinct queries, both databases perform similarly.

For high‑cardinality count‑distinct and some complex analytical queries, ClickHouse can be noticeably faster despite weaker hardware.

SQL convenience varies: ClickHouse often allows more concise statements, while Doris sometimes requires nested queries.

The key takeaway is that performance is workload‑dependent; one should choose the engine that aligns with the specific query patterns and hardware resources rather than assuming a universal winner.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance Big Data SQL ClickHouse Database Comparison Spark Doris

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.