Building a Real-Time Data and User Profiling Architecture with Apache Doris at Zhihu
The article details Zhihu's data empowerment team's design and implementation of a low‑cost, high‑response real‑time data platform built on Apache Doris, covering real‑time business metrics, algorithm features, and user profiling, and explains the challenges, architectural choices, tooling, performance gains, and future directions.
Zhihu's data empowerment team created a real‑time data architecture based on Apache Doris to support three core business flows: real‑time business data, real‑time algorithm features, and user profiling, dramatically improving the timeliness of hotspot detection, reducing audience targeting costs, and boosting algorithm accuracy.
The platform adopts a Lambda architecture, using Doris for minute‑level batch processing and Flink for second‑level stream processing, and integrates a suite of tools for data integration, scheduling, and quality monitoring to handle complex de‑duplication, multi‑source joins, and high‑frequency updates.
Key challenges addressed include achieving sub‑second response for audience selection, ensuring 10‑minute freshness for algorithm features, handling massive data volumes (over 900 billion tag records), and providing reliable data quality checks across the pipeline.
Solutions involve partitioning data and workloads, optimizing Doris load mechanisms (Routine Load, Runtime Filters), leveraging bitmap operations for fast set calculations, and building automated monitoring and alerting to reduce manual intervention.
Operational results show daily ingestion of 900 billion rows within three hours, audience estimation in under one second, and end‑to‑end processing times reduced from days to minutes, delivering measurable business benefits such as higher exposure, conversion rates, and reduced engineering effort.
Future work focuses on further reducing latency, enhancing profiling tools, and expanding real‑time capabilities to meet emerging business needs.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.