Big Data 16 min read

Practical Experiences and Lessons Learned in Building a Flink‑Based Real‑Time Computing Platform at Tongcheng‑Elong

This article details the design, implementation, and optimization of a Flink‑based real‑time computing platform at Tongcheng‑Elong, covering the evolution from Storm to Flink, support for FlinkSQL and FlinkStream, metric collection, logging, data lineage, savepoint management, and numerous stability fixes contributed back to the open‑source community.

Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Practical Experiences and Lessons Learned in Building a Flink‑Based Real‑Time Computing Platform at Tongcheng‑Elong

In early 2015 the team built a user‑behavior tracking system on Storm, but later migrated to Flink in 2018 to meet exactly‑once semantics, back‑pressure handling, and large‑scale cluster requirements. The new platform now runs nearly a thousand real‑time jobs across various business lines.

The platform supports two task types: FlinkSQL for users comfortable with SQL and FlinkStream for developers writing Java/Scala code. A four‑step FlinkSQL submission module was created, built on Apache Calcite, that parses, validates, and executes SQL by generating a Flink JobGraph and submitting it via YARN, then monitors the job through the Flink REST API.

Extensive extensions were added to FlinkSQL, including source/target table creation, custom scalar and table functions, and side‑table joins, all wrapped in a user‑friendly API that hides Flink’s low‑level details.

For metrics collection the team replaced Flink’s pull‑based Prometheus integration with a Pushgateway to accommodate YARN‑deployed clusters, and built a monitoring UI that visualizes key indicators such as operator back‑pressure.

A real‑time log aggregation system was introduced, forwarding logs from all TaskManagers to Elasticsearch for searchable, non‑intrusive log retrieval.

Data lineage support was added by instrumenting the Flink client to extract StreamNode information during job submission, enabling downstream lineage tracking.

Savepoint handling was improved by exposing external trigger APIs, allowing smooth job upgrades without state loss.

Stability enhancements addressed several critical issues: an “empty‑run” bug where YARN reported RUNNING while all TaskManagers had exited, connector bugs for RocketMQ, HDFS, Kudu, and Elasticsearch, and a Zookeeper network‑partition problem that caused massive task restarts. The team contributed patches (FLINK‑9187, FLINK‑11887, FLINK‑12246, FLINK‑12219, FLINK‑12247, FLINK‑10052, etc.) back to the Apache Flink community.

The article concludes that a robust big‑data infrastructure requires continuous platform development, close collaboration with open‑source communities, and proactive stability engineering to support rapidly growing real‑time analytics workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataFlinkStreamingData LineagestabilityReal‑Time Computing
Tongcheng Travel Technology Center
Written by

Tongcheng Travel Technology Center

Pursue excellence, start again with Tongcheng! More technical insights to help you along your journey and make development enjoyable.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.