Building Yuedu Group’s Overseas Big Data Platform: Architecture, Offline & Real‑Time Processing
This article details how Yuedu Group designed and implemented an overseas big data platform, covering overall system architecture, offline data‑warehouse construction with dimensional modeling, real‑time streaming using Oceanus and ClickHouse, and future plans for cost reduction and data quality assurance.
Yuedu Group, the largest domestic online literature company, operates multiple brands and has expanded its international business since 2017, serving creators and readers in over 200 countries. The rapid growth of overseas operations relies on a robust big data platform, which is described in three parts: system architecture, platform construction, and application scenarios.
Overall System Architecture
The overseas big data system consists of offline and real‑time components. Business data sources are processed by either offline or real‑time platforms and then consumed by downstream applications.
Offline Computing Platform
Data Warehouse Architecture
Offline data‑warehouse solves multi‑source heterogeneity and data quality issues, lowering the threshold for data use and improving efficiency.
It provides capabilities for effect analysis, problem diagnosis, and alerting.
Offline data is stored in COS, with Spark/Tez on Hive forming a layered warehouse, and Airflow handling task scheduling.
The core idea is dimensional modeling, comprising warehouse layers, subject‑domain construction, and unified metric management. Clear layering isolates changes, improves data production efficiency, and supports rapid business evolution.
Data Warehouse Layers
1. Data Warehouse Layer Significance
High cohesion, low coupling.
Space‑for‑time trade‑off.
Distributed execution reduces risk.
Simplifies complex problems, enhancing data management efficiency.
Adapts to fast‑growing business with low impact on the application layer.
2. Model Layer Division
Source layer (t_sd): stores raw source data with minimal processing, handling synchronization, structuring, cleaning, and historical retention.
Common model layer (t_ed): stores detailed, dimensional, and aggregated metric data, using dimensional models and wide tables to improve reuse and simplify ad‑hoc queries.
Application model layer (t_md): stores highly aggregated, customized reports and metrics.
Subject Construction
Subject domains group closely related data topics, abstracting business processes into modules such as user, content, reading, transaction, traffic, and activity.
Unified Metrics
Unified metric systems ensure consistent terminology and definitions, enabling data management, traceability, and reducing duplicate construction.
Real‑Time Computing Platform
Construction Background
The existing offline warehouse delivers T+1 data, lacking real‑time streams such as live PV/UV and activity effectiveness, which are needed for real‑time and automated decision‑making.
Platform Architecture
To address this, a real‑time streaming pipeline was built on Tencent Cloud, adopting a layered design for the real‑time warehouse.
Component selection:
Oceanus for stream computing, using Flink SQL for simple source/sink configuration; other sources can be connected via Flink JARs.
ClickHouse as the real‑time warehouse with read‑write separation, high write throughput, and fast single‑table queries; materialized views support pre‑computation.
Superset‑based OLAP platform providing permission control, real‑time dashboards, and SQL query tools.
Real‑Time Warehouse Design
A multi‑layer approach embeds processing steps at each layer: the detail layer handles filtering, cleaning, standardization, and masking; the aggregation layer produces multi‑dimensional metric summaries, improving code reuse and production efficiency.
Tencent Cloud Oceanus
Oceanus is a fully managed cloud stream‑processing service built on Apache Flink, enabling rapid construction of click‑stream analysis, e‑commerce recommendation, and IoT applications without managing infrastructure.
Source configuration:
Sink configuration:
Monitoring:
Cloud Database ClickHouse
ClickHouse provides a managed, scalable MPP data‑warehouse service with high query performance, enabling rapid real‑time analytics.
When building a real‑time warehouse with ClickHouse, a read‑write separation strategy is used. Data is written in batches to local tables across the cluster (ODS layer), then distributed tables are created for query layers, with materialized views or views added as needed.
Data Application Scenarios
Offline and real‑time data support several downstream applications:
Report platform for customized report displays.
OLAP platform for visual drag‑and‑drop dashboards.
SQL query platform for ad‑hoc queries.
Future Planning
Reduce costs and improve efficiency by unifying data‑table lifecycle management, optimizing storage and compute scaling.
Ensure data SLA by building data‑quality monitoring for timeliness and accuracy.
Continue building capabilities such as batch‑stream integration to enrich the platform.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Yuewen Technology
The Yuewen Group tech team supports and powers services like QQ Reading, Qidian Books, and Hongxiu Reading. This account targets internet developers, sharing high‑quality original technical content. Follow us for the latest Yuewen tech updates.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
