Databases 9 min read

How to Load Billion-Row Kafka Data into Doris and Compare Performance with ClickHouse

This article demonstrates how to ingest billions of Kafka records into Doris using Routine Load, compares the import process and query performance with ClickHouse’s Kafka engine tables, discusses configuration challenges such as field‑count filtering, and presents practical workarounds and benchmark results.

dbaplus Community
dbaplus Community
dbaplus Community
How to Load Billion-Row Kafka Data into Doris and Compare Performance with ClickHouse

Choosing an Import Method

Doris offers many data‑ingestion options, similar to ClickHouse, but its official documentation shows that the most suitable way for Kafka sources is the Routine Load feature, which only supports Kafka as a data source.

Import Steps

1. Create a Doris OLAP table to hold the incoming data.

2. Define a Routine Load task that reads from the Kafka topic and streams data into the OLAP table.

3. Monitor the task progress with the command SHOW ROUTINE LOAD FOR ${load_name}.

Pain Points

The Kafka topic contains two schemas: one with 9 fields and another with 11 fields. Doris’s Routine Load lacks fine‑grained filtering; the documented options columns_mapping, preceding_filter, and where_predicates cannot filter by field count.

Consulting the Doris PMC confirmed that this capability is not yet available.

Workaround

Routine Load provides a max_error_number parameter that allows a certain number of malformed rows to be ignored. Without this parameter the task fails on rows with 11 fields; with it the import proceeds, skipping the offending rows.

Query Performance Comparison

Approximately 130 million rows were loaded into the Doris table across three BE nodes. An equivalent ClickHouse table was populated with the same volume on three CK servers.

The benchmark query counts hourly IP accesses and returns the top 10 IPs.

Doris query result:

ClickHouse query result:

Both systems delivered query times around 8 seconds. ClickHouse showed a caching effect where a second query executed shortly after the first ran faster (≈2 seconds), while Doris’s query time remained stable.

Conclusion

ClickHouse marginally outperforms Doris in this scenario, though the ClickHouse cluster had roughly half the memory and CPU resources of the Doris cluster. The article highlights the practical steps, limitations, and tuning knobs needed to load massive Kafka streams into Doris.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ClickHouseperformance comparisondata importdorisRoutine Load
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.