Big Data 13 min read

NetEase Big Data User Profiling: Architecture, Tagging System, and Real‑World Applications

This presentation details NetEase's massive multi‑domain data ecosystem, the design of its user‑profile center—including basic, behavior, preference, and predictive tags—ID‑mapping techniques, quality assurance processes, and several real‑time and offline use cases such as marketing, recommendation, growth operations, advertising, and fraud detection.

DataFunTalk
DataFunTalk
DataFunTalk
NetEase Big Data User Profiling: Architecture, Tagging System, and Real‑World Applications

NetEase operates a massive data ecosystem spanning entertainment, e‑commerce, education and more, with billions of daily active accounts that generate multi‑dimensional user behavior data. The company leverages this data to build a comprehensive, domain‑wide user portrait that serves numerous internal business scenarios and external commercial solutions.

1. NetEase Data Overview

Data volume exceeds hundreds of millions of users, with daily active accounts in the hundred‑million range.

Rich product lines cover games, education, e‑commerce, media, etc., creating a complex ecosystem.

High tag coverage (>70%) for quality users.

Provides thematic solutions for participation, traffic, location, relationships, and more.

2. User‑Profile Center Classification

The profile architecture consists of three layers:

Basic tags (e.g., gender, age, education, device, membership).

Relationship layer (IDMapping) that unifies multiple accounts/devices.

Thematic domains such as geography, social connections, search keywords, and knowledge graphs.

Tag categories include:

Basic tags – static attributes like gender, age, location, occupation, etc.

Behavior tags – actions such as clicks, plays, purchases, comments.

Preference tags – interests in travel, shopping, entertainment, finance, gaming, etc.

Predictive tags – algorithm‑generated predictions (e.g., likely to buy a car).

3. IDMapping (Device Unification)

IDMapping links multiple device identifiers to a single user identity using both engineering (SDK) and data‑layer (rule‑based + graph community detection) approaches. Challenges include multi‑device users, device expiration, and noisy data such as borrowed devices or fraudulent accounts.

Storage captures pairs of IDs, timestamps, source information, and frequency, applying time‑decay factors to reduce stale associations.

4. Real‑Time Full‑Link Recommendation

The real‑time pipeline integrates offline features stored in HBase with online calculations, enabling cold‑start handling, cross‑business data fusion, and personalized recommendations. Combined with knowledge graphs, it tracks user behavior to prevent churn and improve conversion.

5. Application Scenarios

Marketing – audience segmentation and insight generation.

Search & Recommendation – providing data to algorithm teams.

Growth Operations – supporting user research and data‑driven operations.

Advertising – enabling precise audience targeting.

Intelligent Fraud Detection – identifying abnormal users, preventing abuse, and improving risk detection by ~6%.

6. Quality Assurance & Governance

Assign a primary owner for each tag to handle business requests and anomalies.

End‑to‑end workflow optimization to accelerate tag review and standardization.

Pre‑release testing and monitoring of tag definitions, enumeration ranges, and quality metrics.

Platform‑based tag lifecycle management, tooling, and continuous iteration.

Overall, NetEase's user‑profile middle platform dramatically improves data productivity, consolidates methodology and products, empowers numerous internal scenarios, and explores external commercialization opportunities.

Big Datafraud detectionreal-time analyticsdata-platformuser profilingTag ManagementID Mapping
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.