How NetEase Game Built StreamflySQL: From Client‑Side to Server‑Side Flink SQL
This article recounts NetEase Game's evolution of its real‑time computation platform Streamfly, detailing the transition from a client‑side Flink SQL solution (StreamflySQL v1) to a server‑side architecture using SQL Gateway (StreamflySQL v2), the challenges faced, and future work.
Abstract: This article is compiled from NetEase Game senior engineer Lin Xiaobo's talk at Flink Forward Asia 2021 on platform construction. It covers the development history of NetEase Game Flink SQL, StreamflySQL v1 based on template JAR, StreamflySQL v2 based on SQL Gateway, and future work.
1. NetEase Game Flink SQL Development History
NetEase Game's real‑time computation platform is called Streamfly, named after the dragon from "How to Train Your Dragon". Since the migration from Storm to Flink, the name Stormfly was changed to Streamfly.
Streamfly originated from the Lambda subsystem of the offline job platform Omega, which initially supported Storm and Spark Streaming before switching to Flink only. In 2019, Lambda was extracted to build the Streamfly platform, and at the end of 2019 the first Flink SQL platform StreamflySQL v1 was launched. This version used a template JAR to provide basic Flink SQL functionality but had limited user experience, leading to a complete rebuild in early 2021 as StreamflySQL v2 based on SQL Gateway.
Understanding the differences between the two versions requires a review of Flink SQL's basic workflow.
When a user submits SQL, it is parsed into a logical plan, optimized by the Planner Optimizer into a physical plan, then code‑generated into DataStream API transformations, and finally the StreamGraphGenerator converts these transformations into a JobGraph submitted to the Flink cluster.
These steps occur inside the TableEnvironment, which may run on the Flink Client or JobManager depending on the deployment mode. Flink supports three cluster deployment modes: Application, Per‑Job, and Session. In Application mode the TableEnvironment runs on the JobManager; in the other two modes it runs on the Client. In all modes the TableEnvironment is one‑time use and exits after submitting the JobGraph.
To better reuse TableEnvironment and provide stateful operations, some projects run TableEnvironment in a separate server‑side process (Server‑side SQL compilation) while others keep it client‑side (Client‑side SQL compilation).
Client‑side compilation performs parsing, translation, and optimization on the client, using template JARs or Flink's SQL Client. It is easy to set up and low‑cost but suffers from poor performance and limited advanced features.
Server‑side compilation moves these steps to an independent server process, similar to traditional databases. It offers better extensibility and performance but currently lacks mature open‑source solutions, requiring deeper knowledge of Flink's internal APIs.
Editor’s note: The Apache Flink community is developing the SQL Gateway component, which will provide native Flink SQL service capabilities and HiveServer2 compatibility, planned for release in version 1.16. See FLIP‑91 and FLIP‑223 for more details.
StreamflySQL v1 used client‑side compilation, while v2 adopts server‑side compilation. The following sections describe each version.
2. StreamflySQL v1 Based on Template JAR
StreamflySQL v1 chose client‑side compilation for three reasons:
Platform integration: The Lambda scheduler is written in Go, allowing dynamic generation of shell scripts to invoke different frameworks, providing flexibility but preventing direct use of Flink’s native Java API.
Loose coupling: At the time Flink 1.9’s client API was complex and undergoing refactoring, so the team avoided depending on it.
Practical experience: NetEase Game had extensive experience with the template‑JAR + configuration‑center pattern, making it a natural choice for v1.
The overall architecture of v1 added a StreamflySQL backend to the Lambda platform, generating a Lambda job based on the user’s SQL and configuration combined with a common template JAR.
Job submission flow:
User submits SQL and runtime configuration via the front‑end editor.
StreamflySQL backend creates a Lambda job and returns a configuration ID.
Lambda launches the job by executing a Flink CLI run command.
The Flink CLI starts a Flink client that loads the template JAR’s main function, reads the SQL and configuration, and initializes the TableEnvironment.
TableEnvironment reads necessary metadata (databases, tables) from the catalog. NetEase Game uses separate metadata services rather than a unified catalog.
TableEnvironment compiles the JobGraph and deploys the job using a Per‑Job cluster.
v1 achieved a zero‑to‑one Flink SQL platform but had several pain points:
Slow response: Starting a Flink SQL job took at least 40 seconds due to lazy initialization of TableEnvironment and Flink clusters.
Difficult debugging: Debugging required replacing the sink with a PrintSink, limiting resources, and waiting for the job to finish (often >10 minutes) before results were returned.
Only single‑statement DML support: v1 only supported INSERT statements; SELECT, DDL, and DCL were not supported.
3. StreamflySQL v2 Based on SQL Gateway
To address v1’s shortcomings, StreamflySQL v2 adopts a server‑side SQL compilation architecture using the open‑source SQL Gateway (Ververica). The backend embeds SQL Gateway into a SpringBoot application, resulting in a more complex but higher‑performance system.
Key improvements:
Response time reduced from ~1 minute to under 10 seconds.
Debug preview no longer waits for job completion; results are streamed back via a socket using SQL Gateway’s temporary table feature.
SQL support expanded to DML, DQL, and DDL.
Challenges encountered:
Metadata persistence: SQL Gateway stores metadata only in memory; after integration with SpringBoot, metadata is persisted to a database, covering session catalogs, functions, tables, and jobs.
Multi‑tenant support: Resources are isolated by launching separate Session Clusters per tenant, and authentication is handled via Hadoop ProxyUser with Kerberos delegation tokens.
Horizontal scaling: Stateless StreamflySQL instances can be scaled horizontally; session affinity routing ensures a user’s requests hit the same instance for continuity.
Job state management: Added monitoring threads with optimistic locking to track job status, and implemented stop‑with‑savepoint and retained checkpoint handling to manage Flink state.
Future work includes solving state migration when SQL changes, fine‑grained resource management beyond session‑level configuration, and contributing improvements back to the Flink community, especially FLIP‑91.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
