Big Data 20 min read

How Doris Powered Zuoyebang’s Real‑Time Data Warehouse for Faster Insights

Zuoyebang’s data team replaced fragmented, slow query solutions with Apache Doris, building a unified real‑time data warehouse that dramatically cut query latency from hours to seconds, streamlined data modeling, and improved reliability across diverse business scenarios, while integrating with Flink, Kafka, and ES via a unified API.

Zuoyebang Tech Team

Jun 7, 2022

How Doris Powered Zuoyebang’s Real‑Time Data Warehouse for Faster Insights

Background

In the big‑data ecosystem, data analysis systems are critical for turning data into business value. Zuoyebang’s data team uses Apache Doris, an MPP database, to provide fast, easy‑to‑use analytics.

Challenges with Existing Solutions

Previous approaches (Kafka+Spark, direct ES queries, custom APIs) suffered from high maintenance cost, poor scalability, limited performance, and complex joins.

Real‑Time Query System Architecture

After months of exploration, the team built a Doris‑based real‑time query system. The architecture moves data from ingestion to a Flink‑SQL processing layer, then to Doris for storage and query. Kafka streams are synchronized to the query engine, and OpenAPI provides a unified query interface.

Doris Features

Doris consists of a Frontend (FE) and Backend (BE). FE handles SQL parsing, planning, and metadata; BE executes the plan and stores data. It supports MySQL protocol, column‑store, bitmap indexes, and does not depend on third‑party components.

How Doris Solves Business Scenarios

For traffic‑type metrics (PV, UV) the team uses Doris aggregation models with roll‑up tables to pre‑aggregate daily data, reducing query latency from minutes/hours to seconds. For teacher‑workbench data that requires flexible column queries, Doris on ES is used, leveraging ES’s arbitrary column search while benefiting from Doris’s performance optimizations.

Performance Optimizations

Doris on ES bypasses the ES coordinating node, directly scans doc IDs on data nodes, uses column‑store, predicate push‑down, and locality‑aware data transfer, achieving tens‑fold speedup over Presto on ES.

Implementation Details

Unified metadata management using env, db, table concepts; JSON‑schema defines column constraints.

Standardized table creation, deletion, and modification to ensure performance and consistency.

OpenAPI provides unified read/write interfaces, automatic data validation, versioning, and monitoring of QPS and latency.

Separate Kafka topics for real‑time and offline repair streams allow quota‑based consumption, preventing data overwrite and ensuring timeliness.

Micro‑batch scheduling joins real‑time data with dimension tables, automatically handling missing dimensions and enabling automatic data repair.

Best Practices and Recommendations

Avoid excessive joins in ADS layer; use broadcast or shuffle joins only when necessary.

Increase ES scan timeout for large tables under high load.

Optimal batch size around 4096 for maximum scan throughput.

Adjust fragment_instance_num when bitmap deduplication slows queries.

Use supervisor and enable core dumps for Doris operations.

Prefer master branch for timely bug fixes but monitor community issues.

Future Work

Current limitations include join latency on large ES tables, limited dynamic partition control for ES tables, and lack of online schema changes. Plans include improving join performance, adding dynamic partition support, and extending predicate push‑down.

Q&A

Q1: Comparison of Doris on ES vs SparkSQL on ES – Doris offers better low‑latency performance and simpler SQL compatibility, while SparkSQL is suited for high‑throughput streaming.

Q2: Doris does not support Hive Metastore; metadata for Flink‑SQL is managed separately in the team’s metadata system.

Q3: The internal _version field is used for data freshness and is set automatically by the data pipeline; it differs from HBase versioning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink SQL Elasticsearch Kafka Real-time Data Warehouse Apache Doris

Written by

Zuoyebang Tech Team

Sharing technical practices from Zuoyebang

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.