Flink-based Real-time Data Warehouse Practice at Yanxuan
This talk presents Yanxuan’s real‑time data warehouse built on Flink, covering background challenges, overall architecture and implementation, data quality measures, monitoring, and practical application scenarios, while highlighting design goals of flexibility, high development efficiency, and stringent data quality requirements.
Speaker Yang Xiong, senior R&D engineer at NetEase Yanxuan, shares the practice of building a real‑time data warehouse (DW) for Yanxuan using Apache Flink.
1. Background
The project started in the second half of 2017 and faces three main challenges: (1) a long, rapidly changing business chain that generates a wide range of data domains; (2) increasing demand for real‑time data to support business decisions and interactive activities; and (3) higher data‑quality requirements because data directly influences operational decisions.
The design goals derived from these challenges are flexibility and scalability, high development efficiency, and high data‑quality standards.
2. Overall Design and Implementation
The real‑time DW architecture follows the data flow and is divided into several layers:
ODS (Operational Data Store) layer: collects raw data from various business systems via ingestion tools and stores it in Kafka.
DWD (Data Warehouse Detail) layer: processes raw data with Flink, performs enrichment, and writes results to appropriate storage.
DIM (Dimension) layer: stores dimension data for high‑concurrency queries, typically in HBase.
DM (Data Mart) layer: aggregates metrics for analysis and downstream applications.
Data moves from ODS to DWD, where Flink + a real‑time compute engine performs transformations. Processed data is written back to Kafka for downstream consumption and stored in different media according to usage patterns (Redis, MySQL, HBase, Greenplum, etc.). The system also includes services for unified query and metric management.
The design adopts a four‑layer model (ODS, DWD, DIM, DM) and defines five subject domains—product, traffic, transaction, marketing, and warehousing—resulting in 25 models and 135 online tasks. Model reuse, especially in the transaction domain, accelerates development (a simple model can be built in one day).
3. Data Quality
Data quality is addressed through consistency and monitoring. Consistency ensures that real‑time and offline pipelines share the same modeling method, domain design, data ingestion, and metric definitions. Monitoring covers task failures, RPS anomalies, and Kafka latency, with alerts routed through a duty‑rotation process.
Real‑time data lineage is being built to trace dependencies from ODS through DIM to DM, enabling impact analysis for data or task changes.
4. Application Scenarios
The real‑time DW supports three major scenario categories: data products, online operational activities, and business back‑office systems. Over 84 online models (110+ total) deliver sub‑10‑second latency, with dashboard data often refreshed within milliseconds.
Typical use cases include real‑time sales dashboards, hot‑product rankings, activity‑driven user consumption rankings, resource‑slot optimization, warehouse capacity monitoring, logistics timeliness, inventory warnings, and product change notifications.
5. Outlook
Future work focuses on three areas: (1) performance – migrating MySQL models to Elasticsearch and moving dimension tables to Redis; (2) development efficiency – consolidating SQL and API development, extending SQL with UDFs; (3) data quality – establishing generic validation tools and standards to improve decision‑making accuracy.
Overall, the Flink‑based real‑time data warehouse meets Yanxuan’s design goals of flexibility, rapid development, and high data quality, while supporting a wide range of business applications.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
