Building Real‑Time User Profiles at Zhihu with Apache Doris: A Practical Guide
Zhihu's data‑empowerment team designed a low‑cost, high‑response real‑time data architecture on Apache Doris that powers business analytics, algorithm features, and user profiling, dramatically improving timeliness, reducing targeting costs, and boosting key performance metrics across multiple services.
Background
Zhihu’s data‑empowerment team built a real‑time data platform on Apache Doris to provide low‑cost, high‑response, stable and flexible services for three core streams: real‑time business metrics, algorithm features, and user profiling.
Goals
Deliver instant business indicators for hotspot and potential content detection.
Generate diverse real‑time algorithm features to improve DAU, retention and payment metrics.
Support multi‑dimensional, multi‑type user segmentation and analysis to reduce targeting cost.
Challenges
Fast, accurate and convenient user‑segmentation tools are required.
Real‑time user‑behavior streams and algorithm updates must be refreshed within minutes (e.g., 10 min).
Data integration involves high‑cost joins and billions of rows per day.
Latency constraints: segmentation in seconds, cross‑tag TGI in ~10 min.
Architecture
The system is organized into four layers: Application, Business‑Model, Business‑Tool, and Infrastructure. Data flows from source systems through a real‑time integration layer, a scheduling layer, and a data‑quality center before being consumed by applications and profiling services.
Implementation
Real‑time Business Data
Provides instant business metrics for hotspot and potential detection, reducing backend script maintenance.
Offers complex external indicators to improve user experience.
Real‑time Algorithm Features
Generates diverse algorithm features from real‑time data, collaborating with recommendation teams.
Ensures feature updates within 10 minutes to maintain recommendation quality.
User Profiling (DMP)
Handles 200+ tags and >900 billion tag‑user records.
Uses bitmap‑based calculations, parallel fragment execution, and runtime filters to accelerate joins.
Partitions bitmap tables with colocate groups to avoid shuffle.
Tooling Layer
Data Integration : Configurable pipelines for Kafka, Pulsar, protobuf, etc., with custom ingestion logic.
Scheduling : Supports Broker Load and Routine Load; dependency checks based on Kafka offsets prevent premature execution.
Data Quality Center : Monitors completeness, consistency, accuracy, uniqueness and timeliness; provides alerts and automated remediation.
Performance Results
Daily ingestion of >900 billion rows completed within 3 hours.
User segmentation latency P95 ≈ 985 ms; full audience generation ≈ 5 s.
User analysis completed within 5 minutes.
Runtime filter added in Doris 0.14 reduced join time from >40 s to ~10 s.
Parallel fragment execution and bitmap partitioning lowered TGI calculation time to the 5‑minute target.
Key Optimizations
Split large Broker Load jobs into >1000 parallel tasks to improve import throughput.
Set parallel_fragment_exec_instance_num to increase concurrency for bitmap operations.
Enabled Bloom‑filter based runtime filter for join key pruning.
Adjusted load parameters for Routine Load to meet strict latency requirements.
Future Work
Strengthen the tool‑layer to further lower real‑time development cost.
Extend data‑quality tooling to cover profiling data and sub‑second latency scenarios.
Explore sub‑5‑minute and sub‑second processing pipelines for highly time‑sensitive use cases.
Develop deeper user‑understanding tools to accelerate insight discovery and value creation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
