Big Data 25 min read

Design and Implementation of Vivo's Bees Log Collection Agent

This article presents the design principles, core techniques, and practical solutions of Vivo's self‑developed Bees log collection agent, covering file discovery, unique identification, real‑time and offline ingestion, checkpointing, resource control, platform management, and a comparison with open‑source alternatives.

Architecture Digest

Dec 2, 2022

Design and Implementation of Vivo's Bees Log Collection Agent

In the construction of enterprise big‑data systems, data acquisition is the first and most critical step; existing open‑source collectors often cannot meet large‑scale, governed ingestion needs, prompting many companies to develop custom agents. This article shares Vivo's experience designing the Bees log collection agent.

Overview

Data acquisition gathers logs from various sources (application, server, database, IoT) into big‑data storage. Log files are the most common source, and the article outlines a typical architecture consisting of a log‑collection agent, transport/storage (Kafka, HDFS), and a management platform.

Key Features & Capabilities

Real‑time and offline log collection

Non‑intrusive file monitoring

Custom filtering, matching, formatting, and rate‑limiting

Second‑level latency, breakpoint‑resume, and visual task management

Rich monitoring metrics and low resource consumption

Design Principles

Simplicity and elegance

Robustness and stability

Key Design Details

4.1 Log file discovery & listening

Static file lists are insufficient for rotating logs; Bees uses directory paths with wildcard or regex patterns. For efficient change detection, it combines Linux inotify with a fallback polling mechanism. The Java implementation uses WatchService:

/**
 * 订阅文件或目录的变更事件
 */
public synchronized BeesWatchKey watchDir(File dir, WatchEvent.Kind<?>... watchEvents) throws IOException {
    if (!dir.exists() && dir.isFile()) {
        throw new IllegalArgumentException("watchDir requires an exist directory, param: " + dir);
    }
    Path path = dir.toPath().toAbsolutePath();
    BeesWatchKey beesWatchKey = registeredDirs.get(path);
    if (beesWatchKey == null) {
        beesWatchKey = new BeesWatchKey(subscriber, dir, this, watchEvents);
        registeredDirs.put(path, beesWatchKey);
        logger.info("successfully watch dir: {}", dir);
    }
    return beesWatchKey;
}
public synchronized BeesWatchKey watchDir(File dir) throws IOException {
    WatchEvent.Kind<?>[] events = {
            StandardWatchEventKinds.ENTRY_CREATE,
            StandardWatchEventKinds.ENTRY_DELETE,
            StandardWatchEventKinds.ENTRY_MODIFY
    };
    return watchDir(dir, events);
}

4.2 Unique file identification

File names are unreliable for uniqueness; Bees combines the inode number with a 128‑byte content hash to form a stable identifier. Example of obtaining the inode:

ls -i access.log
62651787 access.log

Signature generation:

public static String signFile(File file) throws IOException {
        String filepath = file.getAbsolutePath();
        String sign = null;
        RandomAccessFile raf = new RandomAccessFile(filepath, "r");
        if (raf.length() >= SIGN_SIZE) {
           byte[] tbyte = new byte[SIGN_SIZE];
           raf.seek(0);
           raf.read(tbyte);
           sign = Hashing.sha256().hashBytes(tbyte).toString();
        }
        return sign;
    }

4.3 Log content reading

To read continuously appended logs, Bees uses RandomAccessFile with a movable pointer, enabling efficient tail‑reading and checkpointing:

RandomAccessFile raf = new RandomAccessFile(file, "r");
byte[] buffer;
private void readFile() {
    if ((raf.length() - raf.getFilePointer()) < BUFFER_SIZE) {
        buffer = new byte[(int) (raf.length() - raf.getFilePointer())];
    } else {
        buffer = new byte[BUFFER_SIZE];
    }
    raf.read(buffer, 0, buffer.length);
}

4.4 Checkpoint & resume

Agents record the current file pointer after successful Kafka transmission and persist it to a local JSON file (example below) every few seconds. On restart, the agent loads the checkpoint and seeks to the saved position, guaranteeing no data loss or duplication.

[
    {
        "file": "/home/sample/logs/bees-agent.log",
        "inode": 2235528,
        "pos": 621,
        "sign": "cb8730c1d4a71adc4e5b48931db528e30a5b5c1e99a900ee13e1fe5f935664f1"
    }
]

4.5 Real‑time data sending

Data is sent directly to Kafka via a Netty‑based RPC client, with an intermediate bees‑bus component that aggregates traffic, provides cross‑region failover, and buffers spikes.

4.6 Offline collection

For batch analytics, Bees writes completed hourly logs to HDFS using FSDataOutputStream. A rate‑limiting mechanism smooths the burst at the top of each hour to protect network bandwidth and disk I/O.

4.7 Log cleanup strategy

A shell script deletes logs older than six hours. Even after unlinking, the agent can continue reading an open file handle until the data is fully consumed.

4.8 Resource consumption & control

CPU isolation via taskset JVM heap limits (default 512 MB, minimum 64 MB)

Disk I/O throttling (e.g., 3‑5 MB/s) and monitoring

Network bandwidth monitoring and adaptive rate control

4.9 Self‑monitoring

Agents also collect their own logs, sending them through the same pipeline to Kafka → Elasticsearch → Kibana for unified visibility.

4.10 Platform management

A centralized UI provides version view, heartbeat, task dispatch, start/stop, and rate‑limit control via periodic HTTP heartbeats.

Comparison with Open‑Source Agents

Much lower memory footprint (no Channel, JVM as low as 64 MB)

Sub‑second latency using inotify vs. polling in Flume

Accurate file identity via inode + signature

Thread‑level isolation per topic

Graceful shutdown without data loss

Richer metrics (rate, progress, JVM stats, GC count)

Extensible filtering, formatting, and platform integration

Conclusion

The Bees agent has been in production since 2019, serving tens of thousands of instances and ingesting petabytes of logs daily. Its design—covering discovery, unique identification, real‑time/offline ingestion, checkpointing, resource control, and platform management—demonstrates a robust, scalable solution for enterprise‑level log collection.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Resource Management Kafka Agent Design log collection inotify

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.