How Douyu Built a Scalable Real‑Time Flink Platform on Kubernetes
Douyu’s journey from early Spark and Storm streaming to a Kubernetes‑native Flink platform illustrates the architectural design, challenges, and solutions for large‑scale real‑time computing, data warehousing, and future scalability in a high‑traffic live‑streaming environment.
Background
Douyu, founded in 2014, needed near‑real‑time data processing. Around 2018 it introduced Spark Streaming and Storm for 5‑minute and hourly scenarios, but these technologies struggled as requirements grew.
In 2019 Douyu adopted Flink using a Flink‑jar development approach, which proved costly and complex.
By late 2019 and early 2020, a Kubernetes‑based Flink real‑time computing platform was built, supporting both SQL and JAR job development, internally named the “Xuanwu Computing Platform”.
After launch, the platform supported advertising, dashboards, recommendation, system monitoring, risk control, data analysis, and real‑time tagging. By Q3 2021 it served over 100 users, 2000+ vCores, 500+ jobs, processing more than a trillion records daily.
Real‑time Platform Construction
Before Xuanwu, Flink‑jar development faced four pain points: high development threshold, high deployment cost, lack of monitoring/alerting, and no job version management.
The new platform is built on a K8s cluster and consists of four layers:
Platform layer: metadata management, job management, operations, demo cases, monitoring dashboard, scheduling, and alerting.
Service layer: Flink job service and Flink gateway service, providing SQL validation, debugging, job run/stop, and log query.
Scheduling layer: uses K8s container images to run multiple Flink versions, includes SQL mapping for connector differences, and offers full job state tracking.
K8s cluster layer: basic runtime environment.
The platform offers capabilities such as SQL‑based job development, online debugging, syntax checking, multi‑version jobs, metadata management, configuration masking, cluster management, and parameter tuning.
Challenges and Solutions
First challenge: deploying Flink on K8s required separate JM and TM instance groups. A standalone K8s deployment created two instance groups (JM and TM) with the same HA cluster ID to bind them, ensuring resource isolation by launching a dedicated Flink cluster per job.
To align K8s pod resources with Flink‑conf settings, the Flink image entrypoint was modified to pull job definitions and replace memory size in the flink‑conf file; later native K8s parameters solved this.
Second challenge: monitoring job status. Each job is abstracted as a message in a Zookeeper‑based queue with states Accept, Running, Failed, Cancel, and Finish, each handled by a dedicated thread pool. For Running jobs, the thread pool queries the Flink UI via Nginx Ingress to obtain the job state.
Third challenge: reading Hive tables and using Hive‑UDFs. FlinkSQL submission was split into three parts: job assembly, context initialization, and execution. Job assembly supports SDK GET (service call) and FILE GET (local SQL file). Context initialization sets Hive‑like parameters and injects a Catalog, registering it in Flink.
Real‑time monitoring and debugging were added through a Flink Gateway Server that re‑packages Flink cluster APIs, offering syntax check, submission, status check, stop, and mock execution. SQL mock rewrites sources to data generators and sinks to console, enabling safe online debugging.
Metrics are reported via a custom Metrics Reporter to a Kafka cluster, consumed by Flink tasks, aggregated, and pushed to Prometheus; Grafana visualizes dashboards covering resource, stability, Kafka, and CPU/memory monitoring.
Real‑time Data Warehouse Exploration
The first real‑time warehouse used Kafka as the middle layer: DB and LOG data were written to Kafka via Canal and logging services as the ODS layer. Flink cleaned and enriched data to produce the DWD layer, which was then aggregated into the DWS layer and finally written to HBase, MySQL, Elasticsearch, Redis, ClickHouse, etc.
Issues identified: limited Kafka retention time, separate offline and real‑time storage, difficulty of ad‑hoc queries, and poor data back‑tracking support.
The second solution introduced Iceberg as the intermediate store, injecting Iceberg metadata via Catalog. While solving some Kafka problems, Iceberg added latency because data visibility depends on checkpoint‑based commits, making it unsuitable for low‑latency scenarios.
A custom metadata service was built to maintain catalog information and enable parallel use of both solutions, with ongoing exploration of more convenient real‑time warehouse models.
Future Outlook
Douyu envisions three directions: dynamic scaling of Flink jobs for automatic resource adjustment; simplifying real‑time warehouse development to lower the entry barrier and promote large‑scale adoption; and establishing a comprehensive real‑time data quality monitoring system for verification and traceability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
