Big Data 8 min read

How Zhihu Built a Scalable DMP: Architecture, Data Pipelines, and Real‑Time Targeting

This article details Zhihu's Data Management Platform (DMP), covering the business problems it solves, the end‑to‑end workflow, feature taxonomy, system architecture, data pipelines for batch and streaming, audience targeting processes, performance challenges, and future technical directions.

dbaplus Community
dbaplus Community
dbaplus Community
How Zhihu Built a Scalable DMP: Architecture, Data Pipelines, and Real‑Time Targeting

Background

Zhihu’s product ecosystem required a unified Data Management Platform (DMP) to centralize user‑level feature data, enable precise audience segmentation, and support data‑driven product development.

Business Process

The DMP supports three closed‑loop operational models:

In‑site operation loop : content‑driven, activity‑driven, and user‑driven campaigns executed within the platform.

In‑site to out‑site loop : growth‑driven advertising placed externally, with performance feedback collected.

Out‑site to in‑site loop : external advertising imports users back into the platform for further targeting.

Feature Hierarchy

Features are organized into three layers:

Level 1: 8 primary groups.

Level 2: 40 secondary groups.

Tag groups: 120 groups (e.g., gender, device brand, interest topics) containing roughly 2.5 million distinct tags.

The platform stores about 1.1 billion user‑x‑tag records (≈110 billion feature instances) and supports billions of data points for downstream analysis.

System Architecture

The DMP is divided into three module categories:

External modules : high‑stability, high‑concurrency APIs; lightweight UI; configurable back‑end to minimize development effort.

Business modules : scalable crowd selection, insight, and generalization capabilities; new features or rules can be added with near‑zero cost.

Support modules : feature production, ID mapping, task orchestration, and storage designed for horizontal scaling and cost‑effective growth.

Function Inventory

Since launch the platform has delivered:

>50 000+ crowd‑targeting operations.

>400+ crowd‑insight analyses.

>60+ crowd‑generalization tasks.

Feature Data Pipeline

Offline pipeline (Spark) :

Hive → Feature extraction → Offline tags → Mapping → Doris / Elasticsearch / HDFS

Real‑time pipeline (Flink) :

Kafka → Feature extraction → Real‑time tags → Mapping → Doris / Elasticsearch / HDFS

Storage Layer

Doris :

User‑x‑Tag table – ~1.1 billion rows.

ID‑Mapping wide table – ~850 million rows.

Elasticsearch : Tag dictionary for search – ~2.5 million entries.

Daily data throughput reaches 2.x TB, accumulating to ~11 TB over a five‑day window (offline + real‑time).

Audience Targeting Workflow

The end‑to‑end process consists of:

Tag search.

Tag selection.

Crowd estimation (≤ 1 s).

Crowd selection (≤ 1 min).

Seed‑crowd upload (optional).

Crowd generalization.

Typical pipelines include tag‑to‑cart → selection, seed‑crowd → generalization, and historical‑effect‑crowd → insight → re‑tag → selection.

Performance Challenges

The platform must handle a massive feature space (≈1.2 trillion feature instances) while meeting low‑latency requirements for estimation (≤ 1 s) and selection (≤ 1 min).

First‑Round Optimizations

Inverted‑index construction to accelerate tag lookup.

ID‑mapping tables to reduce join cost.

Refined query logic to minimize data scans.

Second‑Round Optimizations (Divide‑and‑Conquer)

Group contiguous user IDs and assign a common group identifier.

Perform set operations (union, intersect, difference) within each group.

Parallelize group‑level computation across multiple threads to achieve the required latency.

Future Directions

Automatic detection of complex SQL conditions and generation of derived bitmap features to rewrite queries as bitmap operations.

Direct writing of Doris tablet files from Spark jobs, bypassing intermediate storage and improving ingestion speed.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataReal-time Processingfeature engineeringData PlatformDMPzhihu
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.