Big Data 19 min read

From Client‑Side to Server‑Side: How NetEase Built StreamflySQL on Flink SQL

This article chronicles NetEase Games' evolution of its real‑time StreamflySQL platform, detailing the transition from a client‑side Flink SQL implementation to a server‑side architecture powered by SQL Gateway, and discusses the motivations, design choices, challenges, and performance improvements achieved.

ITPUB
ITPUB
ITPUB
From Client‑Side to Server‑Side: How NetEase Built StreamflySQL on Flink SQL

01 Overview of StreamflySQL Evolution

NetEase Games' real‑time computation platform, originally named Streamfly (derived from the movie How to Train Your Dragon ), evolved from the offline Omega job platform's Lambda subsystem, which initially supported Storm and Spark Streaming before migrating to Flink. In 2019 the Lambda subsystem was extracted to form Streamfly, and later that year the first Flink‑SQL version, StreamflySQL v1, was launched using a template‑jar approach.

To understand the differences between the two versions, the article first reviews the basic Flink SQL workflow: user‑submitted SQL is parsed into a logical plan, optimized into a physical plan, code‑generated into DataStream API transformations, and finally translated into a JobGraph by the StreamGraphGenerator for submission to a Flink cluster. This process runs inside a TableEnvironment, which may be deployed in the Flink client or JobManager depending on the chosen deployment mode (Application, Per‑Job, or Session).

02 StreamflySQL v1 – Template‑Jar Architecture (Client‑Side Compilation)

StreamflySQL v1 adopted client‑side SQL compilation for three main reasons:

Platform integration: the Lambda scheduler is written in Go, allowing dynamic shell‑script generation for various frameworks, but preventing direct use of Flink’s native Java API.

Loose coupling: at the time (Flink 1.9) the client API was complex and undergoing refactoring, so the team avoided a hard dependency.

Practical experience: extensive internal use of the template‑jar + configuration‑center pattern made it a natural choice.

The overall architecture generated a Lambda job that packaged the user’s SQL and configuration into a template jar, then invoked flink run to launch a Flink client, create a TableEnvironment, and submit the resulting JobGraph.

Key pain points of v1:

Slow response time : initializing the TableEnvironment, planning, and launching a per‑job cluster took ~40 seconds per job.

Difficult debugging : debugging required full job execution, limited resource usage, and results were only available after job completion.

Limited SQL support : only single‑statement INSERT (DML) was supported; DQL, DDL, and DCL were unavailable.

03 StreamflySQL v2 – Server‑Side Compilation with SQL Gateway

To address v1’s shortcomings, StreamflySQL v2 switched to a server‑side compilation model built on the open‑source SQL Gateway (Ververica). The new architecture embeds SQL Gateway into a SpringBoot service, uses a Session Cluster for deployment, and separates resource initialization from job execution.

Key improvements:

Response time reduced to ~10 seconds.

Debugging now streams results via a socket‑based temporary table, eliminating the need to wait for job termination.

SQL support expanded to DML, DQL, and DDL.

Challenges encountered include:

Metadata persistence : SQL Gateway stores metadata only in memory; NetEase persisted it in a database by integrating the gateway into SpringBoot.

Multi‑tenant isolation : leveraged Lambda’s queue‑based Session Clusters for resource isolation and Hadoop ProxyUser for Kerberos‑based authentication.

Horizontal scalability : stateless service instances allow easy scaling, while session affinity ensures consistent user experience.

Job state management : added monitoring threads with optimistic locking to track job status and implemented stop‑with‑savepoint functionality to manage Flink checkpoints.

04 Future Work

Planned enhancements include:

State migration analysis to assess compatibility of Savepoints after SQL changes.

Fine‑grained resource management for SQL jobs, potentially extending DataStream‑level resource controls to the SQL API.

Contributing back improvements to the Flink community, especially around FLIP‑91 SQL Gateway.

Overall, the transition from a client‑side, template‑jar based platform to a server‑side, SQL‑Gateway powered architecture dramatically improved latency, debugging experience, and SQL feature coverage while laying a foundation for multi‑tenant, scalable real‑time analytics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataFlinkSQL GatewayServer-side Compilation
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.