How Youzan Scaled Real‑Time Analytics with Flink: Architecture, Pitfalls, and Lessons
This article walks through Youzan's real‑time platform architecture, explains why Flink was chosen over Spark Structured Streaming, details practical challenges such as container over‑provisioning and monitoring overhead, shares solutions for Spring integration and async caching, and outlines future directions for SQL‑based streaming and scheduler improvements.
1. Introduction
The article is organized into five parts: an overview of Youzan's real‑time platform architecture, the rationale for selecting Flink, detailed Flink practice at Youzan, the move toward SQL‑based real‑time computing with a UI, and future outlooks.
2. Youzan Real‑Time Platform Architecture
The platform consists of several key components:
Message middleware: Kafka and NSQ (a Go‑based system with a Java client that guarantees at‑least‑once delivery).
Computation engines: legacy Storm jobs, Spark Streaming (micro‑batch, high latency, limited state support), and the newer Flink engine.
Storage layer: MySQL, HBase, Elasticsearch, and ZanKV (a Redis‑compatible distributed KV store developed in‑house).
Real‑time OLAP: Druid for multidimensional analytics.
Platform services: cluster, project, task management and alert monitoring.
The architecture sets the stage for evaluating Flink.
3. Why Flink Over Spark Structured Streaming
Performance comparison highlights:
Latency: Flink, as a true stream engine, delivers lower latency than Spark’s micro‑batch model.
Throughput: Spark may be slightly ahead in some scenarios, but Flink’s RocksDB‑backed state management excels for large intermediate states.
State management: Flink supports in‑memory and RocksDB state, enabling incremental checkpoints that only persist changed SST files, reducing OOM risk.
SQL support: Flink supports complex aggregations, distinct, flexible windows, and both insert‑only and upsert semantics, whereas Spark Structured Streaming lacks many of these features.
API flexibility: Flink allows seamless conversion between Table API and DataStream API, while Spark forces users into DStream or RDD for many operations.
4. Flink Practice at Youzan
4.1 Issue FLINK‑9567 – Excessive Containers
During deployment on YARN (single‑job mode), a task that requested five containers unexpectedly expanded to over 100 containers, most idle. The article explains YARN integration components (TaskExecutor, SlotManager, Flink ResourceManager, AMRMClient) and the pending‑slot/ pending‑container counters.
When containers are pre‑empted, the ResourceManager immediately requests new containers, leading to over‑provisioning. Two mitigation attempts were made: (1) suppress immediate container requests after failures, risking stuck slot requests; (2) check pending slots before requesting new containers, avoiding unnecessary allocations.
4.2 Issue – Monitoring Overhead
Enabling latency monitoring for a simple job (2 sources, 2 operators, parallelism 2) generated a quadratic amount of metric data (N = p² × sources × operators). At parallelism 10 the JobMaster received >24 000 records, consuming ~1.6 GB and causing Full GC and OOM.
Solutions include:
Flink‑10243: finer‑grained latency metrics (single‑mode) reducing data volume.
Flink‑10246: dedicated low‑priority ActorSystem for metric queries (back‑ported to 1.5.5/1.6.2).
Flink‑10252: ongoing work to limit metric message size.
4.3 Integration with Spring
Two common pitfalls:
Starting a Spring context on the client side and trying to use beans inside TaskManager JVMs – impossible because they run in separate processes.
Creating a Spring context in each RichFunction’s open method, leading to multiple contexts per JVM and conflicts with OperatorChain and CoLocationGroup optimizations.
The resolution is to wrap the Spring context in a singleton, ensuring a single instance per JVM and accessing beans via that singleton in the operator’s open method.
4.4 Async and Cache Challenges
When using RichAsyncFunction, keyed state cannot be shared across different keys, causing cache writes to land in the wrong key bucket. The fix is to implement a custom operator that directs all records to a common key, uses MapState for caching, and implements AsyncFunction in an AbstractRichFunction subclass.
5. SQL‑Based Real‑Time Computing and UI
Initially, SDKs were used to build SQL jobs, which proved unfriendly for end users. The current approach registers data sources and tables (schemas, metadata) in the platform, generates JSON schemas from sample messages, and executes SQL on Flink. Unsupported SQL features are extended via UDFs, and the system tracks dependencies between real‑time jobs for end‑to‑end monitoring.
6. Future Directions
Planned explorations include comparing Flink’s batch/ML modules with Spark, investigating Flink‑Hive integration (FLINK‑10566), and improving scheduler and resource management. Proposed changes separate the scheduler from the execution graph, enable plug‑in schedulers, and support auto‑scaling (issues FLINK‑10404, FLINK‑10429).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
