Big Data 18 min read

Tencent Oceanus: Evolution, Productization, and Optimizations of Real‑Time Stream Computing with Flink

This article recounts Tencent's journey from adopting Flink to building the Oceanus platform, detailing its architecture, product features, and a series of deep extensions—including UI redesign, JobManager failover, checkpoint handling, enhanced windows, LocalKeyBy, watermark idle detection, and log isolation—aimed at supporting trillion‑scale real‑time data processing.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Tencent Oceanus: Evolution, Productization, and Optimizations of Real‑Time Stream Computing with Flink

Tencent's real‑time computing team provides a high‑performance, stable, and easy‑to‑use streaming data service, handling peaks of 2.1 × 10⁸ events per second, daily volumes of 1.7 × 10¹³ events, and 20 trillion processing operations per day. To meet these demands, Tencent selected Apache Flink as the next‑generation stream engine, heavily customized the community version, and built Oceanus, an end‑to‑end visual real‑time computing platform that integrates development, testing, deployment, and operations.

The talk outlines four topics: the evolution of Flink usage at Tencent, productization of the Oceanus platform, cloud‑based streaming services, and deep extensions and optimizations made to the community Flink.

Flink adoption timeline : In 2017 Tencent evaluated Flink as a replacement for Storm due to Storm's lack of state support, fault‑tolerance, window API, and exactly‑once guarantees. After a pilot in 2017, Tencent began productizing Flink in 2018, creating Oceanus on YARN and later offering the platform to Tencent Cloud and external private‑cloud customers. By 2019 the platform supported a wide range of Tencent products (WeChat, Pay, QQ, Music, Games, etc.) and processed up to 2.1 × 10⁸ events per second with a daily message volume near 20 trillion.

Oceanus platform overview : Oceanus uses a customized Flink engine called TDFLINK, integrates with other big‑data components, and supports three application building modes—canvas, SQL, and JAR. It provides full lifecycle management (configuration, testing, deployment) and domain‑specific services such as ETL, monitoring, and recommendation. The UI shows application lists, status, versioning, and detailed metric dashboards, while canvas‑based apps allow drag‑and‑drop construction of transform operators and window logic.

Key extensions and optimizations :

Rebuilt Flink Web UI to expose critical metrics and a “Threads” tab for TaskManager debugging.

Implemented hot‑standby JobManager failover to avoid full job restarts during leader switches.

Redesigned checkpoint failure handling with a new CheckpointFailureManager and failure counters, giving the coordinator full control over failure tolerance.

Created Enhanced Window and Incremental Window features that tolerate arbitrary event delays and support multi‑trigger aggregation within large windows, exposed via custom SQL keywords.

Developed LocalKeyBy, a two‑stage key‑by that performs local pre‑aggregation to mitigate data skew and improve performance under heavy skew.

Added watermark idle detection on downstream transform operators to prevent pipeline stalls when upstream partitions become empty.

Separated framework and user logs by customizing class‑loader behavior and injecting per‑job log configurations, including a UI tab that lists all log files for easier inspection.

Many of these enhancements have been contributed back to the Flink community, and Tencent continues to seek collaborators for tackling trillion‑scale data challenges.

The presentation concludes with an invitation to explore Oceanus further via QR codes and to discuss ideas offline.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataFlinkReal-time StreamingTencentOceanus
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.