Big Data 11 min read

NetEase Yanxuan DMP Tag System Construction Practice

This article details NetEase Yanxuan’s DMP tag system, covering its platform overview, tag production workflow, storage architecture, high‑performance query techniques, and future plans, illustrating how data from multiple sources is processed through ODS, DWD, DM layers and leveraged via Spark, Hive, and Apache Doris for real‑time and offline analytics.

DataFunTalk
DataFunTalk
DataFunTalk
NetEase Yanxuan DMP Tag System Construction Practice

Platform Overview The DMP serves as NetEase Yanxuan’s data middle‑platform, ingesting logs from self‑operated apps, internal data, and third‑party channels (JD, Taobao, Douyin). After collection and cleaning, data is stored as assets and used to build a tag‑centric user portrait system that supports intelligent product selection, precise outreach, and user insight.

Core Concepts Tag : descriptive attributes of entities (e.g., age, location, preferences). Audience Circle : a subset of users selected by combining tag conditions. Portrait Analysis : behavior and tag distribution analysis of a selected audience.

Business Capabilities The system provides (1) tag query for basic entity information and (2) audience circle selection, which includes real‑time and offline modes for grouping judgment, result set extraction, and portrait analysis.

Workflow 1) Define tag and audience rules; 2) Translate rules into a DSL and submit to Spark; 3) Store results in Hive and Doris; 4) Business services query Hive/Doris as needed.

Tag Production Data flows from ODS (raw logs, binlog) → DWD (detail tables) → DM (aggregated tag layer). Automation is high for ODS, partial for DWD, and limited for DM. Tags are classified by timeliness (offline, near‑real‑time, real‑time), granularity (aggregated vs. detail), and category (account, consumption, activity, preference, asset).

Tag Storage The first version used multiple engines (Hive, HBase, Kudu, Elasticsearch, Redis) leading to complexity and data‑consistency risks. Version 2 consolidates storage with Apache Doris for both offline and real‑time tags, while still using Hive for bulk data and Redis for cached audience sets, achieving acceptable query latency (≤20 ms p99) and reduced operational overhead.

High‑Performance Query Static audience packages are pre‑computed and stored in Redis, evaluated via Lua scripts. Real‑time audience selection pulls data from APIs and Doris, employing async queries, short‑circuit logic, and join reduction. Doris UDFs enable path analysis not natively supported.

Future Planning • Migrate remaining Hive/Spark workloads to Doris to boost storage‑compute performance. • Refine the tag ecosystem with richer evaluation metrics, higher quality, faster production, and broader coverage. • Enhance user analysis models and generalized portrait capabilities to support smarter operations.

Overall, the DMP demonstrates a data‑driven, tag‑based approach to fine‑grained user operation, balancing scalability, performance, and maintainability across the big‑data stack.

Big Datadata pipelineHiveSparkTag SystemApache DorisDMPReal-time Query
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.