Practical Experience of Apache Flink at ByteDance: Architecture, Optimizations, and Future Directions
This article presents ByteDance's real‑world use of Apache Flink, covering the platform's overall architecture, SQL extensions, custom connectors, UI‑driven SQL platform, performance optimizations such as window mini‑batch and custom windows, dimension‑table enhancements, checkpoint recovery improvements, stream‑batch integration, and upcoming roadmap items.
ByteDance Apache Flink Committer Li Benchao shares a comprehensive case study of Flink usage at ByteDance, organized into four sections: overall introduction, practice optimizations, stream‑batch integration, and future work.
Overall introduction – After Blink was open‑sourced in Dec 2018, ByteDance built an internal SQL platform on the Flink master branch and released a Blink‑planner‑based streaming SQL platform in Oct 2019. The team extended Flink 1.9 with DDL support (CREATE TABLE/VIEW/FUNCTION, ADD RESOURCE) and watermark definitions, and added numerous connectors (RocketMQ source; RocketMQ, ClickHouse, Doris, LogHouse, Redis, Abase, Bytable, ByteSQL, RPC, Print, Metrics sinks) together with formats (PB, Binlog, Bytes).
The platform also provides a visual SQL editor, parser, debugger, custom UDF/connector support, version control, and job management.
Practice optimizations
1. Window performance : introduced window mini‑batch to reduce state access overhead (≈20‑30% CPU saving) and added custom window types and offsets to satisfy business scenarios such as hourly UV/GMV per live‑streamer.
-- my_window is a custom window satisfying specific partitioning
SELECT
room_id,
COUNT(DISTINCT user_id)
FROM MySource
GROUP BY
room_id,
my_window(ts, INTERVAL '1' HOURS)2. Dimension‑table optimizations : implemented delayed join to cache unmatched rows and retry, added key‑by support for dimension tables to improve cache locality, and explored broadcast dimension tables and mini‑batch dimension operators to reduce external I/O.
3. Join optimizations : analyzed limitations of Interval Join, Regular Join, and Temporal Table Function, and applied fixes for out‑of‑order joins, retract amplification, and watermark stagnation.
4. Checkpoint recovery enhancements : addressed failures caused by operator ID changes, state schema mismatches, and DAG modifications. Added compatibility interfaces for serializers, migration paths for incompatible state, and special handling for aggregation code generation when new metrics are added.
SELECT
room_id,
COUNT(DISTINCT user_id)
FROM MySource
GROUP BY
room_id,
TUMBLE(ts, INTERVAL '7' DAY, INTERVAL '3' DAY)Stream‑batch integration – Recognizing that most business logic can be expressed in SQL, ByteDance built a unified Flink‑based architecture where streaming jobs write to MQ for downstream streaming consumption, while batch jobs read from Flink‑produced Hive tables, eliminating data‑source and engine heterogeneity.
Benefits include a single SQL dialect for both modes, shared UDFs, reduced learning and maintenance costs, and unified optimizer improvements.
Future work and roadmap
Planned enhancements cover retract amplification mitigation, window local‑global support, fast event‑time emission, broadcast dimension tables, broader mini‑batch support (dimension tables, TopN, joins), full Hive‑SQL compatibility, and business extensions such as pushing streaming SQL coverage to 80%, delivering a stream‑batch product, and standardizing real‑time data warehouses.
The presentation concludes with thanks and a call for readers to like, share, and follow the DataFunTalk community.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
