Solving Edge Observability: How LoongCollector Ensures Reliable Data Collection
This article explains the three major challenges of collecting observability data on edge devices—unstable networks, reliable delivery, and bandwidth limits—and shows how LoongCollector’s persistent‑asynchronous architecture, smart back‑pressure, and configurable flow control provide a low‑resource, high‑reliability solution with real‑world performance results.
Background
With the rapid growth of cloud computing and IoT, many business scenarios push computation and data collection to the edge, such as smart‑manufacturing lines, in‑vehicle systems, retail terminals, and smart homes. These devices generate valuable logs, metrics, and traces that are essential for operation, fault diagnosis, and user‑experience optimization.
Three Major Challenges for Edge Data Collection
Challenge 1: Unstable Network Environment
Weak network: Mobile signal fluctuations, unstable Wi‑Fi, and high latency across regions cause low bandwidth and high packet loss.
Power supply not guaranteed: Many devices rely on batteries or may experience sudden power loss.
Severe resource constraints: Edge devices have limited CPU, memory, storage, and network bandwidth.
Challenge 2: Reliable Delivery of Observability Data
Data loss risk: Network interruptions, power outages, or process crashes can discard data.
Order guarantee: Time‑series data (metrics, traces) must preserve the collection order.
Challenge 3: Bandwidth Limitation
High traffic cost: 4G/5G data fees are far higher than data‑center dedicated lines.
Bandwidth competition: Collection traffic competes with business traffic for limited bandwidth.
Upload rate limits: Some networks impose strict upload caps.
LoongCollector Overview
LoongCollector is an open‑source, high‑performance, highly reliable observability data collector from Alibaba Cloud. It has been proven in Alibaba Cloud’s internal deployment of millions of instances and is specially optimized for edge scenarios.
Core Capabilities
Host monitoring: Real‑time collection of CPU, memory, disk, network and >100 system metrics.
Prometheus protocol: Full compatibility with the Prometheus ecosystem, supporting all Prometheus‑compatible applications.
Log collection: Efficient text‑log ingestion with multiple formats and parsers.
Ultra‑Low Resource Consumption
LoongCollector is heavily optimized for devices with scarce resources, allowing more collection tasks on the same hardware or stable operation on extremely constrained devices.
Solution Architecture: Persistence + Asynchronous Sending + Intelligent Retry
Data is first written to local files (persistence), then a dedicated sender thread reads the files and transmits data (asynchronous sending). This decouples collection from network state, ensuring no data loss during power cuts or crashes.
Local Persistence
All metric data is stored in files. A fine‑grained checkpoint records the read offset of each file, so after a crash or power loss the collector resumes from the exact point without data loss.
Asynchronous Consumption
The sender thread reads persisted files in order, guaranteeing that data is sent in the same chronological order it was collected. File rotation and sequence numbers ensure correct ordering across multiple files.
Smart Back‑Pressure and Flow Control
Queue back‑pressure: When the send queue reaches a threshold, file reading is paused to prevent memory explosion.
Traffic limiting: The max_bytes_per_sec parameter caps the outbound bandwidth, protecting business traffic.
Adaptive concurrency: Inspired by TCP congestion control, LoongCollector dynamically adjusts the number of concurrent senders based on network conditions, providing fast response, quick convergence, and automatic recovery.
Configuration Examples
A typical edge deployment includes a host‑monitor input and a Prometheus input, each flushed to a local file.
{
"discard_old_data": false,
"config_server_lost_connection_timeout": 604800,
"force_quit_read_timeout": 604800,
"max_bytes_per_sec": 1048576,
"cpu_usage_limit": 0.4,
"mem_usage_limit": 384,
"working_ip": "192.168.0.1"
} enable: true
inputs:
- Type: input_host_monitor
Interval: 15
flushers:
- Type: flusher_file
MaxFileSize: 104857600
MaxFiles: 10
FilePath: /usr/local/ilogtail/metrics/host.log enable: true
inputs:
- Type: input_prometheus
ScrapeConfig:
job_name: node
host_only_mode: true
scrape_interval: 15s
scrape_timeout: 10s
static_configs:
- targets: ["localhost:12345"]
flushers:
- Type: flusher_file
MaxFileSize: 524288000
MaxFiles: 10
FilePath: /usr/local/ilogtail/metrics/metric.logPerformance Test Results
On a representative edge device, LoongCollector exhibits minimal resource usage while staying within the configured bandwidth limit.
CPU: average 0.02 core, peak 0.028 core.
Memory: average 31.5 MB, peak 35 MB.
Network (after compression): average 1.07 KB/s, peak 1.10 KB/s (raw data before back‑pressure was ~13 KB/s).
Conclusion and Outlook
LoongCollector effectively tackles edge‑observability challenges by guaranteeing reliable data delivery, providing local persistence, decoupling collection from sending, and implementing intelligent back‑pressure and flow control. Nevertheless, further improvements are planned:
Simplify pipeline configuration by integrating persistence directly into a single pipeline.
Add support for Alibaba Cloud STS temporary credentials to avoid AccessKey leakage.
Explore more aggressive compression algorithms to further reduce traffic costs.
Alibaba Cloud Observability
Driving continuous progress in observability technology!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
