Big Data 14 min read

Architecture and Practices of Zhihu DMP System Based on Doris

This article presents a comprehensive overview of Zhihu's Data Management Platform (DMP), covering its business background, three core business modes, detailed architecture, offline and real‑time data pipelines, feature storage design, performance optimization techniques, and future iteration directions.

DataFunTalk
DataFunTalk
DataFunTalk
Architecture and Practices of Zhihu DMP System Based on Doris

The presentation introduces Zhihu DMP, explaining its business background, the need for a customized data platform to support internal operations, and outlines four key aspects: background, architecture & implementation, challenges & solutions, and future outlook.

Three business modes are described—external‑to‑internal, internal‑to‑external, and internal closed‑loop—each supporting scenarios such as feed recommendation, advertising, detail‑page prompts, activity platforms, push systems, and external ad delivery.

Core functional requirements focus on audience management, including audience integration, targeting, and insight capabilities.

The platform architecture is divided into external modules (high‑availability APIs, simple front‑end, configurable back‑end) and business modules (audience selection, insight, ID‑mapping, feature production, storage, and AB‑testing), forming four major functional blocks.

Data pipelines consist of offline Spark batch processing that generates tag tables in Hive, followed by ID‑mapping to create unified user IDs stored in Doris, and real‑time Flink streams that produce live tags and perform the same mapping. Tags are indexed in Elasticsearch for fast lookup, while Doris stores the final user‑tag and ID‑mapping tables.

Performance challenges in audience targeting are addressed through bitmap inverted indexes, converting logical conditions to bitmap operations, and a “divide‑and‑conquer” strategy that leverages Doris colocate groups and newer bitmap functions to reduce network I/O and improve query speed.

Optimizations achieve sub‑second audience estimation and minute‑level audience selection, meeting operational goals. Future work includes tighter integration of business modules, enhanced A/B testing, automated SQL rewriting for complex queries, and faster data ingestion by writing directly to Doris tablets via Spark.

performance optimizationbig datadata platformUser ProfilingDMPDoris
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.