Big Data 12 min read

Building Yuedu Group’s Overseas Big Data Platform: Architecture, Offline & Real‑Time Processing

This article details how Yuedu Group designed and implemented an overseas big data platform, covering overall system architecture, offline data‑warehouse construction with dimensional modeling, real‑time streaming using Oceanus and ClickHouse, and future plans for cost reduction and data quality assurance.

Yuewen Technology
Yuewen Technology
Yuewen Technology
Building Yuedu Group’s Overseas Big Data Platform: Architecture, Offline & Real‑Time Processing

Yuedu Group, the largest domestic online literature company, operates multiple brands and has expanded its international business since 2017, serving creators and readers in over 200 countries. The rapid growth of overseas operations relies on a robust big data platform, which is described in three parts: system architecture, platform construction, and application scenarios.

Overall System Architecture

The overseas big data system consists of offline and real‑time components. Business data sources are processed by either offline or real‑time platforms and then consumed by downstream applications.

Offline Computing Platform

Data Warehouse Architecture

Offline data‑warehouse solves multi‑source heterogeneity and data quality issues, lowering the threshold for data use and improving efficiency.

It provides capabilities for effect analysis, problem diagnosis, and alerting.

Offline data is stored in COS, with Spark/Tez on Hive forming a layered warehouse, and Airflow handling task scheduling.

The core idea is dimensional modeling, comprising warehouse layers, subject‑domain construction, and unified metric management. Clear layering isolates changes, improves data production efficiency, and supports rapid business evolution.

Data Warehouse Layers

1. Data Warehouse Layer Significance

High cohesion, low coupling.

Space‑for‑time trade‑off.

Distributed execution reduces risk.

Simplifies complex problems, enhancing data management efficiency.

Adapts to fast‑growing business with low impact on the application layer.

2. Model Layer Division

Source layer (t_sd): stores raw source data with minimal processing, handling synchronization, structuring, cleaning, and historical retention.

Common model layer (t_ed): stores detailed, dimensional, and aggregated metric data, using dimensional models and wide tables to improve reuse and simplify ad‑hoc queries.

Application model layer (t_md): stores highly aggregated, customized reports and metrics.

Subject Construction

Subject domains group closely related data topics, abstracting business processes into modules such as user, content, reading, transaction, traffic, and activity.

Unified Metrics

Unified metric systems ensure consistent terminology and definitions, enabling data management, traceability, and reducing duplicate construction.

Real‑Time Computing Platform

Construction Background

The existing offline warehouse delivers T+1 data, lacking real‑time streams such as live PV/UV and activity effectiveness, which are needed for real‑time and automated decision‑making.

Platform Architecture

To address this, a real‑time streaming pipeline was built on Tencent Cloud, adopting a layered design for the real‑time warehouse.

Component selection:

Oceanus for stream computing, using Flink SQL for simple source/sink configuration; other sources can be connected via Flink JARs.

ClickHouse as the real‑time warehouse with read‑write separation, high write throughput, and fast single‑table queries; materialized views support pre‑computation.

Superset‑based OLAP platform providing permission control, real‑time dashboards, and SQL query tools.

Real‑Time Warehouse Design

A multi‑layer approach embeds processing steps at each layer: the detail layer handles filtering, cleaning, standardization, and masking; the aggregation layer produces multi‑dimensional metric summaries, improving code reuse and production efficiency.

Tencent Cloud Oceanus

Oceanus is a fully managed cloud stream‑processing service built on Apache Flink, enabling rapid construction of click‑stream analysis, e‑commerce recommendation, and IoT applications without managing infrastructure.

Source configuration:

Sink configuration:

Monitoring:

Cloud Database ClickHouse

ClickHouse provides a managed, scalable MPP data‑warehouse service with high query performance, enabling rapid real‑time analytics.

When building a real‑time warehouse with ClickHouse, a read‑write separation strategy is used. Data is written in batches to local tables across the cluster (ODS layer), then distributed tables are created for query layers, with materialized views or views added as needed.

Data Application Scenarios

Offline and real‑time data support several downstream applications:

Report platform for customized report displays.

OLAP platform for visual drag‑and‑drop dashboards.

SQL query platform for ad‑hoc queries.

Future Planning

Reduce costs and improve efficiency by unifying data‑table lifecycle management, optimizing storage and compute scaling.

Ensure data SLA by building data‑quality monitoring for timeliness and accuracy.

Continue building capabilities such as batch‑stream integration to enrich the platform.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

architectureBig Datacloud computingReal-time Processingstream processing
Yuewen Technology
Written by

Yuewen Technology

The Yuewen Group tech team supports and powers services like QQ Reading, Qidian Books, and Hongxiu Reading. This account targets internet developers, sharing high‑quality original technical content. Follow us for the latest Yuewen tech updates.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.