How to Load Billion-Row Kafka Data into Doris and Compare Performance with ClickHouse
This article demonstrates how to ingest billions of Kafka records into Doris using Routine Load, compares the import process and query performance with ClickHouse’s Kafka engine tables, discusses configuration challenges such as field‑count filtering, and presents practical workarounds and benchmark results.
Choosing an Import Method
Doris offers many data‑ingestion options, similar to ClickHouse, but its official documentation shows that the most suitable way for Kafka sources is the Routine Load feature, which only supports Kafka as a data source.
Import Steps
1. Create a Doris OLAP table to hold the incoming data.
2. Define a Routine Load task that reads from the Kafka topic and streams data into the OLAP table.
3. Monitor the task progress with the command SHOW ROUTINE LOAD FOR ${load_name}.
Pain Points
The Kafka topic contains two schemas: one with 9 fields and another with 11 fields. Doris’s Routine Load lacks fine‑grained filtering; the documented options columns_mapping, preceding_filter, and where_predicates cannot filter by field count.
Consulting the Doris PMC confirmed that this capability is not yet available.
Workaround
Routine Load provides a max_error_number parameter that allows a certain number of malformed rows to be ignored. Without this parameter the task fails on rows with 11 fields; with it the import proceeds, skipping the offending rows.
Query Performance Comparison
Approximately 130 million rows were loaded into the Doris table across three BE nodes. An equivalent ClickHouse table was populated with the same volume on three CK servers.
The benchmark query counts hourly IP accesses and returns the top 10 IPs.
Doris query result:
ClickHouse query result:
Both systems delivered query times around 8 seconds. ClickHouse showed a caching effect where a second query executed shortly after the first ran faster (≈2 seconds), while Doris’s query time remained stable.
Conclusion
ClickHouse marginally outperforms Doris in this scenario, though the ClickHouse cluster had roughly half the memory and CPU resources of the Doris cluster. The article highlights the practical steps, limitations, and tuning knobs needed to load massive Kafka streams into Doris.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
