Big Data 10 min read

Practical Implementation of NetEase Yanxuan DMP Tag System: Architecture, Tag Production, Storage, and High‑Performance Query

This article details NetEase Yanxuan's DMP tag system, covering platform overview, tag definitions, production pipelines, multi‑layer storage architecture, high‑performance query techniques, and future roadmap, illustrating how data from various sources is transformed into actionable user tags for refined operations.

DataFunSummit

Sep 21, 2022

Practical Implementation of NetEase Yanxuan DMP Tag System: Architecture, Tag Production, Storage, and High‑Performance Query

The NetEase Yanxuan DMP serves as a data‑centered platform that ingests logs from app, mini‑program, PC, internal data sources, and third‑party channels (JD, Taobao, Douyin), cleanses them, and builds a tag‑based user profile system to support intelligent product selection, precise outreach, and user insight.

Key concepts include tags (feature descriptors for entities), crowd circles (user groups selected by tag conditions), and profile analysis (behavioral and consumption analysis of selected groups).

The system provides two core capabilities: (1) tag query for retrieving specific entity tags, and (2) crowd selection, both in real‑time and offline modes, supporting group judgment, result set extraction, and profile analysis.

Tag production follows a multi‑layer data pipeline: raw logs are stored in the ODS layer, refined into detailed tables in the DWD layer, and finally aggregated into the DM layer where all tags are derived. Automation is high for ODS ingestion, partial for DWD, and limited for DM, with ongoing work to increase automation.

Tags are categorized by timeliness (offline, near‑real‑time, real‑time), granularity (aggregate vs. detail), and business dimension (account attributes, consumption behavior, activity, preferences, asset information).

Storage requirements include high‑performance query, SQL support, update capability, large‑scale data handling, extensible functions, and tight integration with the big‑data ecosystem. Version 1 used a mix of Hive, HBase, Kudu, Elasticsearch, and Redis, leading to complexity and data‑consistency risks.

Version 2 consolidates storage around Apache Doris, keeping offline data in Hive, importing base tags into Doris, and storing real‑time data also in Doris. Spark performs joint queries on Hive + Doris, with results cached in Redis, achieving acceptable latency (≤20 ms for most queries) and reduced operational overhead.

High‑performance query techniques include pre‑computed static crowd packs stored in Redis with Lua‑based batch checks, asynchronous queries, short‑circuit evaluation, and optimized join strategies. Custom Doris UDFs enable path analysis for crowd analytics.

Future plans aim to migrate all Hive and Spark workloads to Doris, enhance the tag evaluation framework, improve tag quality, coverage, and production speed, and build richer user analysis models for more precise operations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Hive Spark tag system Apache Doris DMP

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.