Big Data 11 min read

How OPPO’s ESA DataFlow Handles Billions of Events Daily with High Performance

OPPO's ESA DataFlow is a self‑developed high‑performance data‑flow framework that processes over a trillion events per day, offering flexible routing, scalable sources and sinks, persistent mmap‑based channels, built‑in monitoring, and easy extensibility for diverse data‑collection scenarios.

dbaplus Community

Mar 10, 2020

How OPPO’s ESA DataFlow Handles Billions of Events Daily with High Performance

Background

OPPO Internet services generate hundreds of billions of trace‑chain requests per day. The data must be classified, aggregated, filtered, stored and processed in parallel for alerting and persistence. Open‑source Flume could not satisfy OPPO’s requirements for visualization, monitoring, performance and channel isolation.

Basic Concepts

ESA DataFlow is a high‑performance data‑flow framework developed by OPPO Internet. A single node can process more than 100 billion records per day. Compared with Flume, DataFlow provides:

Flexible message routing – routing rules can direct events to different channels.

High throughput – single‑node throughput exceeds 100 billion events daily; the framework is extensible with built‑in Sources, Channels and Sinks and supports custom serialization.

Integrated monitoring – a management console shows real‑time statistics for input rate, buffer size and consumption rate.

Core Components

DataEvent

The basic unit transferred end‑to‑end. It consists of a header map and a body list.

private Map<String, String> headers = new HashMap<>();
private List<T> body = new ArrayList<>();

Source

A Source receives data from an external system (e.g., HTTP) and routes the events into a Channel. Custom sources are created by extending SourceBase.

Channel

A Channel buffers DataEvents until all attached Sinks have consumed them. By default it uses Kryo for serialization, but other serializers (e.g., Protobuf) can be plugged in. Custom channels extend ChannelBase.

Sink

A Sink pulls events from a Channel and writes them to a destination such as Elasticsearch or RocksDB. Each Sink is bound to a single Channel and is implemented by extending SinkBase.

Framework Evolution

Initial version followed the Flume three‑layer model (Source‑Channel‑Sink) with built‑in HttpSource and MemoryChannel .

HttpSource

Implemented with Netty as a high‑performance HTTP server. Under 500 threads (4 CPU 8 GB, 99 % CPU load) and 1 KB request size, the average throughput reaches 130 k TPS.

MemoryChannel

Uses Java BlockingQueue. It provides low latency but large queue sizes can cause OOM and data loss on process restart.

FileQueueChannel

Introduced to improve reliability. It stores events in a file‑based queue using mmap, guaranteeing local persistence across crashes and restarts. Power‑off failures may lose page‑cache data, but mmap dramatically improves I/O efficiency and overall channel throughput.

Parallel Processing Enhancements

FileQueueChannel originally enforced a one‑to‑one Channel‑Sink relationship, limiting flexibility. The framework was extended to allow multiple Sinks to consume the same data in parallel, each with its own thread pool and independent offset tracking. This prevents a failure in one Sink (e.g., Elasticsearch) from blocking other Sinks (e.g., AlarmSink).

Data Shaping

SDKs report variable‑size and heterogeneous messages. By classifying and reshaping these messages inside the Channel, downstream components receive fixed‑size records, which significantly improves write performance.

Message Routing

Routing rules direct events to specific channels based on header values. Example: events with logType=jvm are routed to xxxChannel.

<route-rules>
  <rule>
    <expression>header.logType="jvm"</expression>
    <targetChannel>xxxChannel</targetChannel>
  </rule>
</route-rules>

Visualization & Monitoring

DataFlow embeds an SQLite database that stores per‑minute metrics for the last seven days and provides a simple web UI. A Prometheus metrics endpoint is also exposed for integration with external monitoring systems.

Application in OPPO

ESA DataFlow is deployed for trace‑chain and log collection. A typical production cluster consists of about ten nodes handling roughly one trillion events per day.

Conclusion

ESA DataFlow enables developers to build high‑throughput data‑collection services quickly. Built‑in components such as HttpSource and FileQueueChannel ensure performance and scalability, while the lightweight architecture and web console facilitate extension and real‑time traffic analysis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

stream processing high performance data ingestion OPPO ESA DataFlow

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.