Big Data 19 min read

How Youzan Scaled Real‑Time Analytics with Flink: Architecture, Pitfalls, and Lessons

This article walks through Youzan's real‑time platform architecture, explains why Flink was chosen over Spark Structured Streaming, details practical challenges such as container over‑provisioning and monitoring overhead, shares solutions for Spring integration and async caching, and outlines future directions for SQL‑based streaming and scheduler improvements.

Youzan Coder
Youzan Coder
Youzan Coder
How Youzan Scaled Real‑Time Analytics with Flink: Architecture, Pitfalls, and Lessons

1. Introduction

The article is organized into five parts: an overview of Youzan's real‑time platform architecture, the rationale for selecting Flink, detailed Flink practice at Youzan, the move toward SQL‑based real‑time computing with a UI, and future outlooks.

2. Youzan Real‑Time Platform Architecture

The platform consists of several key components:

Message middleware: Kafka and NSQ (a Go‑based system with a Java client that guarantees at‑least‑once delivery).

Computation engines: legacy Storm jobs, Spark Streaming (micro‑batch, high latency, limited state support), and the newer Flink engine.

Storage layer: MySQL, HBase, Elasticsearch, and ZanKV (a Redis‑compatible distributed KV store developed in‑house).

Real‑time OLAP: Druid for multidimensional analytics.

Platform services: cluster, project, task management and alert monitoring.

The architecture sets the stage for evaluating Flink.

3. Why Flink Over Spark Structured Streaming

Performance comparison highlights:

Latency: Flink, as a true stream engine, delivers lower latency than Spark’s micro‑batch model.

Throughput: Spark may be slightly ahead in some scenarios, but Flink’s RocksDB‑backed state management excels for large intermediate states.

State management: Flink supports in‑memory and RocksDB state, enabling incremental checkpoints that only persist changed SST files, reducing OOM risk.

SQL support: Flink supports complex aggregations, distinct, flexible windows, and both insert‑only and upsert semantics, whereas Spark Structured Streaming lacks many of these features.

API flexibility: Flink allows seamless conversion between Table API and DataStream API, while Spark forces users into DStream or RDD for many operations.

4. Flink Practice at Youzan

4.1 Issue FLINK‑9567 – Excessive Containers

During deployment on YARN (single‑job mode), a task that requested five containers unexpectedly expanded to over 100 containers, most idle. The article explains YARN integration components (TaskExecutor, SlotManager, Flink ResourceManager, AMRMClient) and the pending‑slot/ pending‑container counters.

When containers are pre‑empted, the ResourceManager immediately requests new containers, leading to over‑provisioning. Two mitigation attempts were made: (1) suppress immediate container requests after failures, risking stuck slot requests; (2) check pending slots before requesting new containers, avoiding unnecessary allocations.

4.2 Issue – Monitoring Overhead

Enabling latency monitoring for a simple job (2 sources, 2 operators, parallelism 2) generated a quadratic amount of metric data (N = p² × sources × operators). At parallelism 10 the JobMaster received >24 000 records, consuming ~1.6 GB and causing Full GC and OOM.

Solutions include:

Flink‑10243: finer‑grained latency metrics (single‑mode) reducing data volume.

Flink‑10246: dedicated low‑priority ActorSystem for metric queries (back‑ported to 1.5.5/1.6.2).

Flink‑10252: ongoing work to limit metric message size.

4.3 Integration with Spring

Two common pitfalls:

Starting a Spring context on the client side and trying to use beans inside TaskManager JVMs – impossible because they run in separate processes.

Creating a Spring context in each RichFunction’s open method, leading to multiple contexts per JVM and conflicts with OperatorChain and CoLocationGroup optimizations.

The resolution is to wrap the Spring context in a singleton, ensuring a single instance per JVM and accessing beans via that singleton in the operator’s open method.

4.4 Async and Cache Challenges

When using RichAsyncFunction, keyed state cannot be shared across different keys, causing cache writes to land in the wrong key bucket. The fix is to implement a custom operator that directs all records to a common key, uses MapState for caching, and implements AsyncFunction in an AbstractRichFunction subclass.

5. SQL‑Based Real‑Time Computing and UI

Initially, SDKs were used to build SQL jobs, which proved unfriendly for end users. The current approach registers data sources and tables (schemas, metadata) in the platform, generates JSON schemas from sample messages, and executes SQL on Flink. Unsupported SQL features are extended via UDFs, and the system tracks dependencies between real‑time jobs for end‑to‑end monitoring.

6. Future Directions

Planned explorations include comparing Flink’s batch/ML modules with Spark, investigating Flink‑Hive integration (FLINK‑10566), and improving scheduler and resource management. Proposed changes separate the scheduler from the execution graph, enable plug‑in schedulers, and support auto‑scaling (issues FLINK‑10404, FLINK‑10429).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringBig DataFlinksqlReal-time StreamingYARN
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.