Databases 10 min read

Importing Billions of Kafka Rows into Doris and Benchmarking Against ClickHouse

This article explains Doris's various data import methods, focuses on the routine load approach for Kafka streams, describes how to handle mixed‑schema topics using the max_error_number parameter, and compares query performance of a 130 million‑row dataset against ClickHouse, highlighting each system's strengths and limitations.

ITPUB
ITPUB
ITPUB
Importing Billions of Kafka Rows into Doris and Benchmarking Against ClickHouse

0. Choosing an Import Method

Doris supports multiple data ingestion mechanisms, similar to ClickHouse, but its official documentation lists a richer set of import options. The article skips a detailed comparison of these methods because the documentation already covers them.

1. Import Steps

To load Kafka data into Doris, you first create an OLAP table that will hold the data, then define a ROUTINE LOAD task that pulls data from the Kafka topic into that table. The routine load task is analogous to ClickHouse's materialized view on a Kafka engine table.

2. Pain Points

The Kafka topic used in the test contains two schemas: one with 11 fields and another with 9 fields. Doris's routine load does not provide a built‑in way to filter records by field count. The documented filtering options— columns_mapping, preceding_filter, and where_predicates —only allow column re‑ordering or value‑based conditions, which cannot distinguish the two schemas.

3. Workaround

Since Doris cannot filter by field count, the author uses the max_error_number parameter of routine load. This parameter allows a configurable number of rows that do not match the expected field count to be ignored instead of aborting the load. Adding max_error_number lets the 9‑field records be ingested while the 11‑field records are skipped as errors.

4. Query Comparison

Using the routine load, about 130 million rows were loaded into a Doris table distributed across three BE nodes. The same volume of data was loaded into a ClickHouse sharded table on three CK nodes. Both systems executed an hourly aggregation query that counts the number of accesses per IP and returns the top 10 IPs. Each query took roughly 8 seconds on both platforms. ClickHouse showed faster subsequent queries (e.g., 2 seconds after a 1‑minute interval) due to OS page‑cache effects, whereas Doris’s query latency remained stable around 8 seconds regardless of query spacing.

Overall, while ClickHouse benefits from better caching on repeated queries, Doris provides comparable performance despite using less powerful hardware, and the routine load with max_error_number offers a practical solution for mixed‑schema Kafka streams.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KafkaClickHouseperformance comparisondata importdorisRoutine Load
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.