Big Data 10 min read

Practical Implementation of NetEase Yanxuan DMP Tag System: Architecture, Tag Production, Storage, and High‑Performance Query

This article details NetEase Yanxuan's DMP tag system, covering platform overview, tag definitions, production pipelines, multi‑layer storage architecture, high‑performance query techniques, and future roadmap, illustrating how data from various sources is transformed into actionable user tags for refined operations.

DataFunSummit
DataFunSummit
DataFunSummit
Practical Implementation of NetEase Yanxuan DMP Tag System: Architecture, Tag Production, Storage, and High‑Performance Query

The NetEase Yanxuan DMP serves as a data‑centered platform that ingests logs from app, mini‑program, PC, internal data sources, and third‑party channels (JD, Taobao, Douyin), cleanses them, and builds a tag‑based user profile system to support intelligent product selection, precise outreach, and user insight.

Key concepts include tags (feature descriptors for entities), crowd circles (user groups selected by tag conditions), and profile analysis (behavioral and consumption analysis of selected groups).

The system provides two core capabilities: (1) tag query for retrieving specific entity tags, and (2) crowd selection, both in real‑time and offline modes, supporting group judgment, result set extraction, and profile analysis.

Tag production follows a multi‑layer data pipeline: raw logs are stored in the ODS layer, refined into detailed tables in the DWD layer, and finally aggregated into the DM layer where all tags are derived. Automation is high for ODS ingestion, partial for DWD, and limited for DM, with ongoing work to increase automation.

Tags are categorized by timeliness (offline, near‑real‑time, real‑time), granularity (aggregate vs. detail), and business dimension (account attributes, consumption behavior, activity, preferences, asset information).

Storage requirements include high‑performance query, SQL support, update capability, large‑scale data handling, extensible functions, and tight integration with the big‑data ecosystem. Version 1 used a mix of Hive, HBase, Kudu, Elasticsearch, and Redis, leading to complexity and data‑consistency risks.

Version 2 consolidates storage around Apache Doris, keeping offline data in Hive, importing base tags into Doris, and storing real‑time data also in Doris. Spark performs joint queries on Hive + Doris, with results cached in Redis, achieving acceptable latency (≤20 ms for most queries) and reduced operational overhead.

High‑performance query techniques include pre‑computed static crowd packs stored in Redis with Lua‑based batch checks, asynchronous queries, short‑circuit evaluation, and optimized join strategies. Custom Doris UDFs enable path analysis for crowd analytics.

Future plans aim to migrate all Hive and Spark workloads to Doris, enhance the tag evaluation framework, improve tag quality, coverage, and production speed, and build richer user analysis models for more precise operations.

Big Datadata pipelineHiveSparkTag SystemApache DorisDMP
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.