Big Data 8 min read

Building Real-Time Wide Tables with Partial-Update Using Apache Paimon for NetEase News Recommendation

The article describes how NetEase News' recommendation team replaced a slow, batch‑oriented data‑warehouse pipeline with a Flink‑based, Apache Paimon real‑time wide‑table solution that supports partial updates, reduces latency from hours to minutes, and lowers processing costs while handling both deduplication and non‑deduplication recommendation scenarios.

Big Data Technology Architecture

Nov 29, 2023

Building Real-Time Wide Tables with Partial-Update Using Apache Paimon for NetEase News Recommendation

Background: NetEase News, a leading Chinese media portal, needed a more flexible and timely data‑warehouse for its recommendation system, which serves personalized content across headlines, videos, comments, and circles. The existing architecture could no longer meet the growing diversity and complexity of data processing requirements.

Recommendation scenarios: Two business patterns exist – (1) deduplication, where each user‑device (devid) should receive a unique article (docid) only once, using devid+docid as a primary key; (2) non‑deduplication, where paid or experimental content may be recommended multiple times, requiring a composite key devid+docid+rec_time.

Original pipeline: Recommendation data (rec) and user‑behavior data (sdk) were written to HDFS via an internal datastream tool, then joined offline with Spark on an hourly basis, resulting in a visibility delay of H+1 hour and sub‑optimal data quality.

New solution: The team rebuilt the pipeline with Flink and Apache Paimon, enabling near‑real‑time writes and partial‑update capabilities. This reduced the wide‑table latency from hour‑level to minute‑level (depending on Flink checkpoint intervals) and streamlined the processing chain, saving storage and compute resources.

Deduplication implementation: The primary key remains devid+docid. Rec and sdk streams are written to Paimon tables, and a left‑join is performed in Flink, allowing the wide table to be updated incrementally. The approach improves timeliness and simplifies the data flow.

Non‑deduplication implementation: Because rec_time is unavailable in the sdk stream, the latest rec_time for each devid+docid is cached in Flink map‑state. SDK events retrieve this timestamp from the cache to construct the full key, tolerating occasional inaccuracies under high load.

Overall benefits: Introducing Apache Paimon achieved (1) real‑time data visibility (minute‑level), (2) reduced resource consumption and cost, and (3) a simpler, more maintainable architecture.

Additional optimization – recommendation digitization: The team applied Paimon's native aggregation to replace a heavy hourly batch process that summed metrics across dimensions for 20k+ QPS, achieving lower latency and cost while delivering hourly DWS tables for downstream analysis.

Future plans: Implement data‑eviction policies for partially‑updated wide tables to discard irrelevant rows, and leverage Paimon's cross‑partition update feature to eliminate data drift at partition boundaries.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink Real-time analytics recommendation system Data Lake Wide Table Apache Paimon Partial Update

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.