Tencent Oceanus: Evolution, Productization, and Optimizations of Real‑Time Stream Computing with Flink
This article recounts Tencent's journey from adopting Flink to building the Oceanus platform, detailing its architecture, product features, and a series of deep extensions—including UI redesign, JobManager failover, checkpoint handling, enhanced windows, LocalKeyBy, watermark idle detection, and log isolation—aimed at supporting trillion‑scale real‑time data processing.
Tencent's real‑time computing team provides a high‑performance, stable, and easy‑to‑use streaming data service, handling peaks of 2.1 × 10⁸ events per second, daily volumes of 1.7 × 10¹³ events, and 20 trillion processing operations per day. To meet these demands, Tencent selected Apache Flink as the next‑generation stream engine, heavily customized the community version, and built Oceanus, an end‑to‑end visual real‑time computing platform that integrates development, testing, deployment, and operations.
The talk outlines four topics: the evolution of Flink usage at Tencent, productization of the Oceanus platform, cloud‑based streaming services, and deep extensions and optimizations made to the community Flink.
Flink adoption timeline : In 2017 Tencent evaluated Flink as a replacement for Storm due to Storm's lack of state support, fault‑tolerance, window API, and exactly‑once guarantees. After a pilot in 2017, Tencent began productizing Flink in 2018, creating Oceanus on YARN and later offering the platform to Tencent Cloud and external private‑cloud customers. By 2019 the platform supported a wide range of Tencent products (WeChat, Pay, QQ, Music, Games, etc.) and processed up to 2.1 × 10⁸ events per second with a daily message volume near 20 trillion.
Oceanus platform overview : Oceanus uses a customized Flink engine called TDFLINK, integrates with other big‑data components, and supports three application building modes—canvas, SQL, and JAR. It provides full lifecycle management (configuration, testing, deployment) and domain‑specific services such as ETL, monitoring, and recommendation. The UI shows application lists, status, versioning, and detailed metric dashboards, while canvas‑based apps allow drag‑and‑drop construction of transform operators and window logic.
Key extensions and optimizations :
Rebuilt Flink Web UI to expose critical metrics and a “Threads” tab for TaskManager debugging.
Implemented hot‑standby JobManager failover to avoid full job restarts during leader switches.
Redesigned checkpoint failure handling with a new CheckpointFailureManager and failure counters, giving the coordinator full control over failure tolerance.
Created Enhanced Window and Incremental Window features that tolerate arbitrary event delays and support multi‑trigger aggregation within large windows, exposed via custom SQL keywords.
Developed LocalKeyBy, a two‑stage key‑by that performs local pre‑aggregation to mitigate data skew and improve performance under heavy skew.
Added watermark idle detection on downstream transform operators to prevent pipeline stalls when upstream partitions become empty.
Separated framework and user logs by customizing class‑loader behavior and injecting per‑job log configurations, including a UI tab that lists all log files for easier inspection.
Many of these enhancements have been contributed back to the Flink community, and Tencent continues to seek collaborators for tackling trillion‑scale data challenges.
The presentation concludes with an invitation to explore Oceanus further via QR codes and to discuss ideas offline.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
