Big Data 26 min read

Designing a Scalable Log Collection Agent: Lessons from Vivo’s Bees‑Agent

This article details the end‑to‑end design of Vivo’s custom log‑collection agent, covering file discovery with inotify, unique file identification using inode and content hash, real‑time reading via RandomAccessFile, checkpointing, Kafka integration, offline HDFS ingestion, resource throttling, and platform‑wide management, while comparing it with open‑source alternatives.

Architect
Architect
Architect
Designing a Scalable Log Collection Agent: Lessons from Vivo’s Bees‑Agent

Problem Statement

Enterprise‑scale log collection must handle millions of rotating log files, guarantee sub‑second latency, avoid data loss on crashes or upgrades, and consume minimal resources. Open‑source agents (Flume, Logstash, Filebeat, Fluentd) rely on fixed file lists, polling, or filename‑only identification, which leads to high memory footprints, latency >1 s, duplicate collection, and difficulty managing thousands of tasks.

Design Goals

Simple and elegant architecture

Robustness and stability under high concurrency

Real‑time and batch collection modes

Non‑intrusive file harvesting

Configurable filtering, formatting, and rate‑limiting

Checkpoint‑based resume without data loss

Centralized visual task management

Rich metrics for monitoring and alerting

Key Technical Components

1. File Discovery & Listening

Log files are generated with timestamps (e.g., /data/sample/logs/access.2021110820.log). A directory path plus glob pattern (e.g., access.*.log) is used to match new files. Discovery combines:

Linux inotify via java.nio.file.WatchService for event‑driven notifications (CREATE, DELETE, MODIFY).

A short‑interval poll (fallback) to handle missed events.

Core registration code:

public synchronized BeesWatchKey watchDir(File dir, WatchEvent.Kind<?>... watchEvents) throws IOException {
    if (!dir.exists() && dir.isFile()) {
        throw new IllegalArgumentException("watchDir requires an exist directory, param: " + dir);
    }
    Path path = dir.toPath().toAbsolutePath();
    BeesWatchKey key = registeredDirs.get(path);
    if (key == null) {
        key = new BeesWatchKey(subscriber, dir, this, watchEvents);
        registeredDirs.put(path, key);
        logger.info("successfully watch dir: {}", dir);
    }
    return key;
}

public synchronized BeesWatchKey watchDir(File dir) throws IOException {
    WatchEvent.Kind<?>[] events = {
        StandardWatchEventKinds.ENTRY_CREATE,
        StandardWatchEventKinds.ENTRY_DELETE,
        StandardWatchEventKinds.ENTRY_MODIFY
    };
    return watchDir(dir, events);
}

2. Unique File Identification

Filename alone is insufficient because rotated logs reuse names (e.g., access.log). The agent first reads the inode number ( ls -i) which remains stable across moves/renames. To guard against inode reuse after deletion, a 128‑byte SHA‑256 hash of the file header is appended, forming a composite identifier.

ls -i access.log
62651787 access.log
public static String signFile(File file) throws IOException {
    String path = file.getAbsolutePath();
    RandomAccessFile raf = new RandomAccessFile(path, "r");
    if (raf.length() >= SIGN_SIZE) {
        byte[] head = new byte[SIGN_SIZE];
        raf.seek(0);
        raf.read(head);
        return Hashing.sha256().hashBytes(head).toString();
    }
    return null;
}

3. Incremental Log Reading

After a file is opened, RandomAccessFile seeks to the last known offset and reads only the delta. When the end of file is reached, the thread blocks until new data arrives.

RandomAccessFile raf = new RandomAccessFile(file, "r");
byte[] buffer;
private void readFile() throws IOException {
    long remaining = raf.length() - raf.getFilePointer();
    if (remaining < BUFFER_SIZE) {
        buffer = new byte[(int) remaining];
    } else {
        buffer = new byte[BUFFER_SIZE];
    }
    raf.read(buffer, 0, buffer.length);
}

4. Checkpointing & Resume

After each successful batch is sent to Kafka, the current file pointer is stored in memory. A background task (default every 3 seconds ) persists the checkpoint to a local JSON file, enabling seamless resume after crashes, OOM restarts, or agent upgrades.

[
    {
        "file": "/home/sample/logs/bees-agent.log",
        "inode": 2235528,
        "pos": 621,
        "sign": "cb8730c1d4a71adc4e5b48931db528e30a5b5c1e99a900ee13e1fe5f935664f1"
    }
]

5. Real‑time Data Sending

Instead of a direct Kafka client, agents push records to an intermediate component called bees‑bus . The bus aggregates streams, reduces the number of TCP connections to Kafka brokers, provides cross‑region failover, and buffers spikes to avoid broker back‑pressure. Communication uses Netty RPC (similar to Flume’s NettyAvroRpcClient).

6. Offline (Batch) Collection

For analytics pipelines, the agent can push complete log files to HDFS using FSDataOutputStream. A scheduled hourly job picks the previous hour’s log (e.g., access.2021110820.log) and writes it to a configured HDFS directory.

7. Log Cleanup Strategy

A shell script runs on each host, deleting files older than 6 hours . Because the agent keeps the file handle open, the underlying inode remains readable even after the directory entry is unlinked, allowing the agent to finish reading the remaining data.

8. Resource Control

CPU : Bind the Java process to specific cores with taskset (or Linux cset) to avoid interference with business workloads.

Memory : JVM heap defaults to 512 MB ; the minimum viable setting is 64 MB .

Disk I/O : Throttle read rate (e.g., 3 MB/s or 5 MB/s ) based on observed baseline.

Network : Agents report bandwidth usage to the bus; the bus can instruct agents to lower their sending rate when cross‑region thresholds are approached.

9. Self‑Monitoring & Platform Management

Agent logs are collected by a dedicated task and sent through the same pipeline (Kafka → Elasticsearch → Kibana). A lightweight HTTP heartbeat (default every 5 minutes ) reports health, current tasks, and receives configuration updates (start/stop, rate limits, etc.). The management platform provides centralized view, version tracking, and remote control of tens of thousands of agents.

Comparison with Open‑Source Agents

Memory footprint : Bees‑agent eliminates Flume’s channel, allowing a JVM heap as low as 64 MB (Flume typically requires > 256 MB).

Latency : Inotify‑driven events keep end‑to‑end delay under 1 second , versus Flume’s polling‑based approach (often > 5 seconds).

File uniqueness : Inode + hash avoids duplicate collection; Flume’s filename‑only strategy can miss or double‑collect files.

Resource isolation : Separate threads per topic prevent interference; Flume shares a single channel.

Graceful shutdown : Platform‑issued stop command lets the agent exit without data loss; Flume may lose in‑flight data.

Metrics richness : Bees‑agent exports collection rate, progress, JVM stats, GC counts, and per‑topic metrics.

Customization : Built‑in keyword filtering, formatting, and centralized management.

Conclusion

The Bees‑agent design demonstrates a production‑grade log‑collection service that meets sub‑second latency, low resource consumption, robust checkpointing, and seamless integration with both streaming (Kafka) and batch (HDFS) pipelines. Core techniques—glob‑based discovery, inotify + fallback polling, inode + hash identification, RandomAccessFile incremental reads, JSON checkpoint persistence, and a bus‑mediated data path—provide a reusable blueprint for large‑scale enterprises facing similar ingestion challenges.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataResource ManagementKafkaAgent Designlog collectioninotifyoffline ingestion
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.