Big Data 18 min read

How NetEase Yanxuan Built a Real‑Time Data Lake to Boost Efficiency

This article explains how NetEase Yanxuan evolved from a traditional data‑warehouse pipeline to a cloud‑native data‑lake architecture, detailing the business challenges, design choices, technology stack (Delta, Iceberg, Hudi), implementation steps, and the resulting gains in real‑time data access, cost reduction, and feature‑engineering support.

Yanxuan Tech Team

Mar 29, 2022

How NetEase Yanxuan Built a Real‑Time Data Lake to Boost Efficiency

In recent years, data‑warehouse and data‑lake solutions have rapidly evolved, blurring the line between them. Cloud‑native next‑generation data architectures no longer follow a single classic model but combine the strengths of both. This article shares Yanxuan's data‑lake construction process and insights.

1. Business Background

Since mid‑2017, NetEase Yanxuan has built its own big‑data system, supporting almost all business scenarios such as analytics, search, recommendation, advertising, supply‑chain, risk control, product development, and quality control. As data dependence grew, three major problems emerged:

Low data operation efficiency: heavy reliance on data‑warehouse models leads to high development and iteration costs, slowing innovation.

Frequent schema changes caused by rapid business iteration increase maintenance overhead.

The need for more reliable near‑real‑time mirror tables.

These issues required continuous iteration of the data architecture.

1.1 Data Architecture (Before Iteration)

The diagram shows a typical big‑data pipeline: business systems generate data, which is transmitted and integrated, then processed for analytics, BI, or algorithmic services such as recommendation, search, advertising, and risk control.

Two major pipelines exist: a data‑warehouse flow (pre‑defined schema, cleaned and transformed data for reporting) and a machine‑learning flow (raw data for feature extraction and model inference).

1.2 Current Situation & Goals

To address the problems, Yanxuan evaluated new technologies against four goals:

Problem solving – does the solution bring new capabilities to the data/algorithm system?

Efficiency improvement – how much does it boost operational efficiency?

Cost reduction – can storage, compute, or usage costs be lowered?

Stable rollout – can the solution be deployed at scale without affecting existing services?

The aim is to achieve more real‑time data capabilities without extra storage cost, easing T+1 batch pressure and providing timely features for model training.

2. Is a Data Lake the Solution?

Data‑lake concepts have proliferated, often causing confusion. Fundamentally, a data lake offers higher flexibility: storage format and structure need not be predefined, allowing both structured and semi‑structured data. This contrasts with data‑warehouse designs that enforce strict schemas and modeling.

While both have advantages, they are not mutually exclusive.

2.1 Data Lake vs. Data Warehouse

Data lakes prioritize flexibility and raw‑data accessibility, whereas warehouses prioritize standardized data management through predefined schemas.

2.2 Advantages of a Data Lake

Yanxuan identified two key benefits:

Improved data‑development efficiency: exploratory analysis does not require costly warehouse model creation.

Preservation of raw information: warehouse models inevitably discard some data, which is essential for machine‑learning and deep analysis.

3. Practical Implementation

3.1 Data Integration

Initially, data was loaded in bulk nightly. As volume grew, a V2.0 solution introduced incremental merge for near‑real‑time sync, reducing ODS generation time to about one hour.

However, the V2.0 approach still faced three issues:

Insufficient real‑time performance for large tables.

High storage consumption due to space‑for‑time trade‑offs.

Lack of ACID support, causing temporary unavailability during updates.

Adopting open‑source storage formats (Delta, Iceberg, Hudi) resolved these problems. Yanxuan chose Delta initially for its ACID guarantees and later added Iceberg support once row‑level deletes were available.

3.2 Data‑Warehouse Construction

With a more reliable data lake, Yanxuan built an ODS layer that provides near‑real‑time data access, achieving:

Average latency of ~1 second, minutes for large tables.

~70% reduction in compute and storage costs.

Zero downstream failure thanks to ACID support.

3.3 Feature Engineering

Data lake also powers machine‑learning feature pipelines. Previously, feature inconsistency and long offline training delays (>24 h) were problems. By processing features with Flink, writing results to Redis for online prediction, and appending them to Iceberg tables, Yanxuan achieved real‑time feature availability.

The feature store abstracts storage, offering unified SDKs for both online and offline tasks.

4. Future Plans

Yanxuan has completed the first phase of data‑lake integration, currently in a gray‑scale rollout supporting five tasks. Future work includes deeper collaboration with algorithm and BI teams to extend real‑time data‑lake capabilities to search, recommendation, risk control, and exploratory analytics, while further optimizing compute and storage engines for complex scenarios.

feature engineering data lake Iceberg Hudi Delta Lake

Written by

Yanxuan Tech Team

NetEase Yanxuan Tech Team shares e-commerce tech insights and quality finds for mindful living. This is the public portal for NetEase Yanxuan's technology and product teams, featuring weekly tech articles, team activities, and job postings.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.