Big Data 24 min read

Design and Implementation of Vivo's Bees Log Collection Agent

Vivo’s Bees‑agent is a custom, lightweight log‑collection service that discovers rotating files via inotify, uniquely identifies them with inode and hash signatures, supports real‑time and offline ingestion to Kafka and HDFS, offers checkpoint‑resume, resource isolation, rich metrics, and a centralized management platform, outperforming open‑source collectors in latency, memory usage, and scalability.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Design and Implementation of Vivo's Bees Log Collection Agent

In the construction of enterprise big‑data systems, data collection is the first and most critical step. Existing open‑source collectors often cannot meet the scale and governance requirements of large enterprises, leading many to develop custom agents. This article shares Vivo's experience designing and implementing the Bees log‑collection service, focusing on the bees‑agent component.

Overview

The log‑collection pipeline typically consists of a log‑collection Agent (e.g., Flume, Logstash, Scribe, Filebeat, Fluentd, or custom solutions), a transport/storage layer (Kafka, HDFS), and a management platform. Bees is Vivo's self‑developed solution that integrates these parts.

Key Features & Capabilities

Real‑time and offline log file collection.

Non‑intrusive, file‑based collection.

Customizable filtering for large logs.

Configurable matching, formatting, and rate‑limiting.

Second‑level latency for real‑time collection.

Breakpoint‑resume capability to avoid data loss during upgrades or crashes.

Centralized visual management platform.

Rich monitoring metrics and alerts.

Low resource consumption (CPU, memory, disk, network).

Design Principles

Simplicity and elegance.

Robustness and stability.

Key Design Details

1. Log file discovery and listening

Static file lists are insufficient for rotating logs. Bees‑agent uses directory paths combined with wildcard patterns (e.g., /data/sample/logs/access.*.log ) to match new files. For efficient change detection, it primarily relies on Linux inotify events, with a polling fallback. The core Java implementation is:

/**
 * Subscribe to file or directory change events
 */
public synchronized BeesWatchKey watchDir(File dir, WatchEvent.Kind
... watchEvents) throws IOException {
    if (!dir.exists() && dir.isFile()) {
        throw new IllegalArgumentException("watchDir requires an exist directory, param: " + dir);
    }
    Path path = dir.toPath().toAbsolutePath();
    BeesWatchKey beesWatchKey = registeredDirs.get(path);
    if (beesWatchKey == null) {
        beesWatchKey = new BeesWatchKey(subscriber, dir, this, watchEvents);
        registeredDirs.put(path, beesWatchKey);
        logger.info("successfully watch dir: {}", dir);
    }
    return beesWatchKey;
}

public synchronized BeesWatchKey watchDir(File dir) throws IOException {
    WatchEvent.Kind
[] events = {
        StandardWatchEventKinds.ENTRY_CREATE,
        StandardWatchEventKinds.ENTRY_DELETE,
        StandardWatchEventKinds.ENTRY_MODIFY
    };
    return watchDir(dir, events);
}

2. Unique identification of log files

File names can be reused, so Bees‑agent uses the inode number combined with a file signature (SHA‑256 hash of the first 128 bytes) to uniquely identify a file. Example commands and code:

ls -i access.log
62651787 access.log
public static String signFile(File file) throws IOException {
    String filepath = file.getAbsolutePath();
    String sign = null;
    RandomAccessFile raf = new RandomAccessFile(filepath, "r");
    if (raf.length() >= SIGN_SIZE) {
        byte[] tbyte = new byte[SIGN_SIZE];
        raf.seek(0);
        raf.read(tbyte);
        sign = Hashing.sha256().hashBytes(tbyte).toString();
    }
    return sign;
}

3. Log content reading

To read newly appended lines without re‑reading the whole file, Bees‑agent uses RandomAccessFile with a movable pointer:

RandomAccessFile raf = new RandomAccessFile(file, "r");
byte[] buffer;
private void readFile() {
    if ((raf.length() - raf.getFilePointer()) < BUFFER_SIZE) {
        buffer = new byte[(int) (raf.length() - raf.getFilePointer())];
    } else {
        buffer = new byte[BUFFER_SIZE];
    }
    raf.read(buffer, 0, buffer.length);
}

4. Checkpoint‑resume

During normal operation the current read position is persisted to a local JSON file every few seconds. After a crash or restart, the agent loads this checkpoint and seeks to the saved offset, ensuring no duplicate or missing data.

[
    {
        "file": "/home/sample/logs/bees-agent.log",
        "inode": 2235528,
        "pos": 621,
        "sign": "cb8730c1d4a71adc4e5b48931db528e30a5b5c1e99a900ee13e1fe5f935664f1"
    }
]

5. Real‑time data sending

Collected logs are sent to Kafka via a Netty‑based bees‑bus component, which aggregates traffic, reduces the number of direct connections to Kafka brokers, and provides buffering for burst scenarios.

6. Offline collection

For batch analytics, the agent can write logs to HDFS using FSDataOutputStream . Rate‑limiting is applied to avoid network and disk spikes during peak collection windows.

7. Resource control

CPU isolation via Linux taskset to bind the agent to specific cores.

Memory limits set through JVM heap options (default 512 MB, minimum 64 MB).

Disk I/O throttling based on configurable max rates (e.g., 3 MB/s).

Network bandwidth monitoring and adaptive rate adjustment.

8. Self‑monitoring and platform management

The agent also collects its own log4j output, sending it downstream to Kafka → Elasticsearch → Kibana for visualization. A centralized management platform provides version view, heartbeat, task dispatch, start/stop control, and rate‑limit configuration via simple HTTP heartbeats.

Comparison with Open‑Source Agents

Much lower memory footprint (no Channel, JVM can run with 64 MB).

Sub‑second latency using inotify vs. polling.

Accurate file identification with inode + signature.

Thread‑level isolation per topic.

Graceful shutdown via platform commands.

Rich metrics (throughput, JVM stats, GC counts).

Extensible filtering, formatting, and platform integration.

Conclusion

The Bees‑agent design addresses the full lifecycle of log collection—discovery, unique identification, real‑time and offline ingestion, resource management, and centralized control—demonstrating a robust, scalable solution for enterprise‑level big‑data pipelines.

JavaBig DataKafkaAgent Designlog collectionHDFSinotify
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.