How OPPO’s ESA DataFlow Handles Billions of Events Daily with High Performance
OPPO's ESA DataFlow is a self‑developed high‑performance data‑flow framework that processes over a trillion events per day, offering flexible routing, scalable sources and sinks, persistent mmap‑based channels, built‑in monitoring, and easy extensibility for diverse data‑collection scenarios.
Background
OPPO Internet services generate hundreds of billions of trace‑chain requests per day. The data must be classified, aggregated, filtered, stored and processed in parallel for alerting and persistence. Open‑source Flume could not satisfy OPPO’s requirements for visualization, monitoring, performance and channel isolation.
Basic Concepts
ESA DataFlow is a high‑performance data‑flow framework developed by OPPO Internet. A single node can process more than 100 billion records per day. Compared with Flume, DataFlow provides:
Flexible message routing – routing rules can direct events to different channels.
High throughput – single‑node throughput exceeds 100 billion events daily; the framework is extensible with built‑in Sources, Channels and Sinks and supports custom serialization.
Integrated monitoring – a management console shows real‑time statistics for input rate, buffer size and consumption rate.
Core Components
DataEvent
The basic unit transferred end‑to‑end. It consists of a header map and a body list.
private Map<String, String> headers = new HashMap<>();
private List<T> body = new ArrayList<>();Source
A Source receives data from an external system (e.g., HTTP) and routes the events into a Channel. Custom sources are created by extending SourceBase.
Channel
A Channel buffers DataEvents until all attached Sinks have consumed them. By default it uses Kryo for serialization, but other serializers (e.g., Protobuf) can be plugged in. Custom channels extend ChannelBase.
Sink
A Sink pulls events from a Channel and writes them to a destination such as Elasticsearch or RocksDB. Each Sink is bound to a single Channel and is implemented by extending SinkBase.
Framework Evolution
Initial version followed the Flume three‑layer model (Source‑Channel‑Sink) with built‑in HttpSource and MemoryChannel .
HttpSource
Implemented with Netty as a high‑performance HTTP server. Under 500 threads (4 CPU 8 GB, 99 % CPU load) and 1 KB request size, the average throughput reaches 130 k TPS.
MemoryChannel
Uses Java BlockingQueue. It provides low latency but large queue sizes can cause OOM and data loss on process restart.
FileQueueChannel
Introduced to improve reliability. It stores events in a file‑based queue using mmap, guaranteeing local persistence across crashes and restarts. Power‑off failures may lose page‑cache data, but mmap dramatically improves I/O efficiency and overall channel throughput.
Parallel Processing Enhancements
FileQueueChannel originally enforced a one‑to‑one Channel‑Sink relationship, limiting flexibility. The framework was extended to allow multiple Sinks to consume the same data in parallel, each with its own thread pool and independent offset tracking. This prevents a failure in one Sink (e.g., Elasticsearch) from blocking other Sinks (e.g., AlarmSink).
Data Shaping
SDKs report variable‑size and heterogeneous messages. By classifying and reshaping these messages inside the Channel, downstream components receive fixed‑size records, which significantly improves write performance.
Message Routing
Routing rules direct events to specific channels based on header values. Example: events with logType=jvm are routed to xxxChannel.
<route-rules>
<rule>
<expression>header.logType="jvm"</expression>
<targetChannel>xxxChannel</targetChannel>
</rule>
</route-rules>Visualization & Monitoring
DataFlow embeds an SQLite database that stores per‑minute metrics for the last seven days and provides a simple web UI. A Prometheus metrics endpoint is also exposed for integration with external monitoring systems.
Application in OPPO
ESA DataFlow is deployed for trace‑chain and log collection. A typical production cluster consists of about ten nodes handling roughly one trillion events per day.
Conclusion
ESA DataFlow enables developers to build high‑throughput data‑collection services quickly. Built‑in components such as HttpSource and FileQueueChannel ensure performance and scalability, while the lightweight architecture and web console facilitate extension and real‑time traffic analysis.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
