Big Data 13 min read

How We Scaled Fatigue Event Processing to 45K TPS with Apache Flink

By iteratively redesigning the fatigue‑event detection pipeline and leveraging Apache Flink’s stateful stream processing, the team reduced network overhead, cut resource usage to a third, and achieved a stable 45,000 TPS throughput on six containers with 20 GB memory, while outlining three optimization phases and practical lessons.

G7 EasyFlow Tech Circle

Apr 23, 2019

How We Scaled Fatigue Event Processing to 45K TPS with Apache Flink

Introduction

This article describes a complex fatigue‑event business and how, after multiple iterations, it was run smoothly on the Apache Flink distributed stream processing engine, reaching a processing capacity of 45,000 TPS using six containers, 24 task slots, and 20 GB of memory.

What Is a Fatigue Event?

Fatigue driving is a major cause of road accidents. It occurs when a driver operates a vehicle for several hours or covers a long distance without rest. In a mature IoT freight‑transport scenario, fatigue calculation must handle driver changes, vehicle stops, offline vehicles, vehicle sharing, and diverse client‑specific logic.

Traditional Fatigue‑Event Calculation

The original architecture involved many components, causing repeated network overhead, high operational complexity, and limited throughput (around 10 K TPS). Adding new fatigue‑event requirements impacted the entire platform, and data loss risk was high.

Using Flink, the architecture was simplified:

The new design reduced network overhead, eliminated many external dependencies, and leveraged Flink’s built‑in fault‑tolerance, achieving a peak TPS of 45 K while using only one‑third of the original resources.

About Flink

Flink is an Apache open‑source stream‑processing engine that supports both unbounded streams and bounded batch processing. It provides stateful computation, allowing results to depend on previous events, which is essential for complex aggregations and correlations.

Continuous Optimization

The fatigue‑event pipeline underwent three optimization stages.

Stage One

All business logic was packed into a single sliding window (LocationWindow), leading to Kafka backlog and insufficient consumption capacity.

Stage Two

Performance bottlenecks were identified in external‑service calls and tightly coupled business logic. The pipeline was split, but three problems remained: database pressure causing back‑pressure, limited concurrency at the Pad‑Location stage, and still‑complex FatigueEvent logic.

Stage Three

Three solutions were applied:

Introduce Guava asynchronous cache and a hash‑based thread pool to improve external‑service latency and increase concurrency.

Refactor FatigueEvent logic using either multiple Flink operators or a responsibility‑chain pattern; the latter avoids extra state overhead.

Offload batch database writes and voice notifications to a Spring Boot service, reducing Flink I/O and handling idempotency.

Key Operators and Components

Kafka source : reads data from Kafka via Flink’s adapter.

String to location : converts JSON strings to location objects and aggregates sub‑device data.

Pad location : processes external‑service calls and filters abnormal data.

FatigueEvent : core fatigue calculation, filtering irrelevant data.

NoSignEvent : handles concurrent unsigned events.

Window : 60‑second time window for ordering and handling out‑of‑order GPS data.

Batch persistence program : Spring Boot batch job for database writes and voice dispatch.

Practical Recommendations

Catch and handle exceptions within Flink jobs to prevent automatic restarts.

ValueState cannot be shared across operators; use external stores like Redis or pass data downstream.

KeyBy is required for ValueState, which adds network traffic; minimize rebalance operations.

When using Dubbo, set the check property to false to avoid deployment errors.

For high parallelism, consider AtLeastOnce checkpointing to avoid back‑pressure caused by slow subtasks.

Be aware of Flink’s Kryo serialization limits; custom serializers may be needed for complex objects.

Avoid third‑party data structures (e.g., Guava EvictingQueue) in ValueState due to serialization issues.

Future Improvements

Improve observability of ValueState (monitoring and inspection) and integrate Flink deployment with continuous‑integration pipelines.

Conclusion

Flink excels at stateful stream processing but is not a universal solution; heavy external I/O should be decoupled from the main stream. Ensure downstream idempotency to handle possible duplicate records. For non‑state‑heavy scenarios, Flink SQL can reduce code size.

Additional resources:

Apache Flink fundamentals: Checkpoint, State, Time, Window.

Relevant blogs: Flink Tech Evolution , Checkpoint , Watermark , State , Window .

Performance optimization stream processing Apache Flink IoT TPS Fatigue Detection

Written by

G7 EasyFlow Tech Circle

Official G7 EasyFlow tech channel! All the hardcore tech, cutting‑edge innovations, and practical sharing you want are right here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.