Designing a Scalable Log Collection Agent: Lessons from Vivo’s Bees‑Agent
This article details the end‑to‑end design of Vivo’s custom log‑collection agent, covering file discovery with inotify, unique file identification using inode and content hash, real‑time reading via RandomAccessFile, checkpointing, Kafka integration, offline HDFS ingestion, resource throttling, and platform‑wide management, while comparing it with open‑source alternatives.
Problem Statement
Enterprise‑scale log collection must handle millions of rotating log files, guarantee sub‑second latency, avoid data loss on crashes or upgrades, and consume minimal resources. Open‑source agents (Flume, Logstash, Filebeat, Fluentd) rely on fixed file lists, polling, or filename‑only identification, which leads to high memory footprints, latency >1 s, duplicate collection, and difficulty managing thousands of tasks.
Design Goals
Simple and elegant architecture
Robustness and stability under high concurrency
Real‑time and batch collection modes
Non‑intrusive file harvesting
Configurable filtering, formatting, and rate‑limiting
Checkpoint‑based resume without data loss
Centralized visual task management
Rich metrics for monitoring and alerting
Key Technical Components
1. File Discovery & Listening
Log files are generated with timestamps (e.g., /data/sample/logs/access.2021110820.log). A directory path plus glob pattern (e.g., access.*.log) is used to match new files. Discovery combines:
Linux inotify via java.nio.file.WatchService for event‑driven notifications (CREATE, DELETE, MODIFY).
A short‑interval poll (fallback) to handle missed events.
Core registration code:
public synchronized BeesWatchKey watchDir(File dir, WatchEvent.Kind<?>... watchEvents) throws IOException {
if (!dir.exists() && dir.isFile()) {
throw new IllegalArgumentException("watchDir requires an exist directory, param: " + dir);
}
Path path = dir.toPath().toAbsolutePath();
BeesWatchKey key = registeredDirs.get(path);
if (key == null) {
key = new BeesWatchKey(subscriber, dir, this, watchEvents);
registeredDirs.put(path, key);
logger.info("successfully watch dir: {}", dir);
}
return key;
}
public synchronized BeesWatchKey watchDir(File dir) throws IOException {
WatchEvent.Kind<?>[] events = {
StandardWatchEventKinds.ENTRY_CREATE,
StandardWatchEventKinds.ENTRY_DELETE,
StandardWatchEventKinds.ENTRY_MODIFY
};
return watchDir(dir, events);
}2. Unique File Identification
Filename alone is insufficient because rotated logs reuse names (e.g., access.log). The agent first reads the inode number ( ls -i) which remains stable across moves/renames. To guard against inode reuse after deletion, a 128‑byte SHA‑256 hash of the file header is appended, forming a composite identifier.
ls -i access.log
62651787 access.log public static String signFile(File file) throws IOException {
String path = file.getAbsolutePath();
RandomAccessFile raf = new RandomAccessFile(path, "r");
if (raf.length() >= SIGN_SIZE) {
byte[] head = new byte[SIGN_SIZE];
raf.seek(0);
raf.read(head);
return Hashing.sha256().hashBytes(head).toString();
}
return null;
}3. Incremental Log Reading
After a file is opened, RandomAccessFile seeks to the last known offset and reads only the delta. When the end of file is reached, the thread blocks until new data arrives.
RandomAccessFile raf = new RandomAccessFile(file, "r");
byte[] buffer;
private void readFile() throws IOException {
long remaining = raf.length() - raf.getFilePointer();
if (remaining < BUFFER_SIZE) {
buffer = new byte[(int) remaining];
} else {
buffer = new byte[BUFFER_SIZE];
}
raf.read(buffer, 0, buffer.length);
}4. Checkpointing & Resume
After each successful batch is sent to Kafka, the current file pointer is stored in memory. A background task (default every 3 seconds ) persists the checkpoint to a local JSON file, enabling seamless resume after crashes, OOM restarts, or agent upgrades.
[
{
"file": "/home/sample/logs/bees-agent.log",
"inode": 2235528,
"pos": 621,
"sign": "cb8730c1d4a71adc4e5b48931db528e30a5b5c1e99a900ee13e1fe5f935664f1"
}
]5. Real‑time Data Sending
Instead of a direct Kafka client, agents push records to an intermediate component called bees‑bus . The bus aggregates streams, reduces the number of TCP connections to Kafka brokers, provides cross‑region failover, and buffers spikes to avoid broker back‑pressure. Communication uses Netty RPC (similar to Flume’s NettyAvroRpcClient).
6. Offline (Batch) Collection
For analytics pipelines, the agent can push complete log files to HDFS using FSDataOutputStream. A scheduled hourly job picks the previous hour’s log (e.g., access.2021110820.log) and writes it to a configured HDFS directory.
7. Log Cleanup Strategy
A shell script runs on each host, deleting files older than 6 hours . Because the agent keeps the file handle open, the underlying inode remains readable even after the directory entry is unlinked, allowing the agent to finish reading the remaining data.
8. Resource Control
CPU : Bind the Java process to specific cores with taskset (or Linux cset) to avoid interference with business workloads.
Memory : JVM heap defaults to 512 MB ; the minimum viable setting is 64 MB .
Disk I/O : Throttle read rate (e.g., 3 MB/s or 5 MB/s ) based on observed baseline.
Network : Agents report bandwidth usage to the bus; the bus can instruct agents to lower their sending rate when cross‑region thresholds are approached.
9. Self‑Monitoring & Platform Management
Agent logs are collected by a dedicated task and sent through the same pipeline (Kafka → Elasticsearch → Kibana). A lightweight HTTP heartbeat (default every 5 minutes ) reports health, current tasks, and receives configuration updates (start/stop, rate limits, etc.). The management platform provides centralized view, version tracking, and remote control of tens of thousands of agents.
Comparison with Open‑Source Agents
Memory footprint : Bees‑agent eliminates Flume’s channel, allowing a JVM heap as low as 64 MB (Flume typically requires > 256 MB).
Latency : Inotify‑driven events keep end‑to‑end delay under 1 second , versus Flume’s polling‑based approach (often > 5 seconds).
File uniqueness : Inode + hash avoids duplicate collection; Flume’s filename‑only strategy can miss or double‑collect files.
Resource isolation : Separate threads per topic prevent interference; Flume shares a single channel.
Graceful shutdown : Platform‑issued stop command lets the agent exit without data loss; Flume may lose in‑flight data.
Metrics richness : Bees‑agent exports collection rate, progress, JVM stats, GC counts, and per‑topic metrics.
Customization : Built‑in keyword filtering, formatting, and centralized management.
Conclusion
The Bees‑agent design demonstrates a production‑grade log‑collection service that meets sub‑second latency, low resource consumption, robust checkpointing, and seamless integration with both streaming (Kafka) and batch (HDFS) pipelines. Core techniques—glob‑based discovery, inotify + fallback polling, inode + hash identification, RandomAccessFile incremental reads, JSON checkpoint persistence, and a bus‑mediated data path—provide a reusable blueprint for large‑scale enterprises facing similar ingestion challenges.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
