Importing Billions of Kafka Rows into Doris and Benchmarking Against ClickHouse
This article explains Doris's various data import methods, focuses on the routine load approach for Kafka streams, describes how to handle mixed‑schema topics using the max_error_number parameter, and compares query performance of a 130 million‑row dataset against ClickHouse, highlighting each system's strengths and limitations.
0. Choosing an Import Method
Doris supports multiple data ingestion mechanisms, similar to ClickHouse, but its official documentation lists a richer set of import options. The article skips a detailed comparison of these methods because the documentation already covers them.
1. Import Steps
To load Kafka data into Doris, you first create an OLAP table that will hold the data, then define a ROUTINE LOAD task that pulls data from the Kafka topic into that table. The routine load task is analogous to ClickHouse's materialized view on a Kafka engine table.
2. Pain Points
The Kafka topic used in the test contains two schemas: one with 11 fields and another with 9 fields. Doris's routine load does not provide a built‑in way to filter records by field count. The documented filtering options— columns_mapping, preceding_filter, and where_predicates —only allow column re‑ordering or value‑based conditions, which cannot distinguish the two schemas.
3. Workaround
Since Doris cannot filter by field count, the author uses the max_error_number parameter of routine load. This parameter allows a configurable number of rows that do not match the expected field count to be ignored instead of aborting the load. Adding max_error_number lets the 9‑field records be ingested while the 11‑field records are skipped as errors.
4. Query Comparison
Using the routine load, about 130 million rows were loaded into a Doris table distributed across three BE nodes. The same volume of data was loaded into a ClickHouse sharded table on three CK nodes. Both systems executed an hourly aggregation query that counts the number of accesses per IP and returns the top 10 IPs. Each query took roughly 8 seconds on both platforms. ClickHouse showed faster subsequent queries (e.g., 2 seconds after a 1‑minute interval) due to OS page‑cache effects, whereas Doris’s query latency remained stable around 8 seconds regardless of query spacing.
Overall, while ClickHouse benefits from better caching on repeated queries, Doris provides comparable performance despite using less powerful hardware, and the routine load with max_error_number offers a practical solution for mixed‑schema Kafka streams.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
