How Zhihu Built a Scalable DMP: Architecture, Data Pipelines, and Real‑Time Targeting
This article details Zhihu's Data Management Platform (DMP), covering the business problems it solves, the end‑to‑end workflow, feature taxonomy, system architecture, data pipelines for batch and streaming, audience targeting processes, performance challenges, and future technical directions.
Background
Zhihu’s product ecosystem required a unified Data Management Platform (DMP) to centralize user‑level feature data, enable precise audience segmentation, and support data‑driven product development.
Business Process
The DMP supports three closed‑loop operational models:
In‑site operation loop : content‑driven, activity‑driven, and user‑driven campaigns executed within the platform.
In‑site to out‑site loop : growth‑driven advertising placed externally, with performance feedback collected.
Out‑site to in‑site loop : external advertising imports users back into the platform for further targeting.
Feature Hierarchy
Features are organized into three layers:
Level 1: 8 primary groups.
Level 2: 40 secondary groups.
Tag groups: 120 groups (e.g., gender, device brand, interest topics) containing roughly 2.5 million distinct tags.
The platform stores about 1.1 billion user‑x‑tag records (≈110 billion feature instances) and supports billions of data points for downstream analysis.
System Architecture
The DMP is divided into three module categories:
External modules : high‑stability, high‑concurrency APIs; lightweight UI; configurable back‑end to minimize development effort.
Business modules : scalable crowd selection, insight, and generalization capabilities; new features or rules can be added with near‑zero cost.
Support modules : feature production, ID mapping, task orchestration, and storage designed for horizontal scaling and cost‑effective growth.
Function Inventory
Since launch the platform has delivered:
>50 000+ crowd‑targeting operations.
>400+ crowd‑insight analyses.
>60+ crowd‑generalization tasks.
Feature Data Pipeline
Offline pipeline (Spark) :
Hive → Feature extraction → Offline tags → Mapping → Doris / Elasticsearch / HDFSReal‑time pipeline (Flink) :
Kafka → Feature extraction → Real‑time tags → Mapping → Doris / Elasticsearch / HDFSStorage Layer
Doris :
User‑x‑Tag table – ~1.1 billion rows.
ID‑Mapping wide table – ~850 million rows.
Elasticsearch : Tag dictionary for search – ~2.5 million entries.
Daily data throughput reaches 2.x TB, accumulating to ~11 TB over a five‑day window (offline + real‑time).
Audience Targeting Workflow
The end‑to‑end process consists of:
Tag search.
Tag selection.
Crowd estimation (≤ 1 s).
Crowd selection (≤ 1 min).
Seed‑crowd upload (optional).
Crowd generalization.
Typical pipelines include tag‑to‑cart → selection, seed‑crowd → generalization, and historical‑effect‑crowd → insight → re‑tag → selection.
Performance Challenges
The platform must handle a massive feature space (≈1.2 trillion feature instances) while meeting low‑latency requirements for estimation (≤ 1 s) and selection (≤ 1 min).
First‑Round Optimizations
Inverted‑index construction to accelerate tag lookup.
ID‑mapping tables to reduce join cost.
Refined query logic to minimize data scans.
Second‑Round Optimizations (Divide‑and‑Conquer)
Group contiguous user IDs and assign a common group identifier.
Perform set operations (union, intersect, difference) within each group.
Parallelize group‑level computation across multiple threads to achieve the required latency.
Future Directions
Automatic detection of complex SQL conditions and generation of derived bitmap features to rewrite queries as bitmap operations.
Direct writing of Doris tablet files from Spark jobs, bypassing intermediate storage and improving ingestion speed.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
