Cloud Native 14 min read

Solving Edge Observability: How LoongCollector Ensures Reliable Data Collection

This article explains the three major challenges of collecting observability data on edge devices—unstable networks, reliable delivery, and bandwidth limits—and shows how LoongCollector’s persistent‑asynchronous architecture, smart back‑pressure, and configurable flow control provide a low‑resource, high‑reliability solution with real‑world performance results.

Alibaba Cloud Observability
Alibaba Cloud Observability
Alibaba Cloud Observability
Solving Edge Observability: How LoongCollector Ensures Reliable Data Collection

Background

With the rapid growth of cloud computing and IoT, many business scenarios push computation and data collection to the edge, such as smart‑manufacturing lines, in‑vehicle systems, retail terminals, and smart homes. These devices generate valuable logs, metrics, and traces that are essential for operation, fault diagnosis, and user‑experience optimization.

Three Major Challenges for Edge Data Collection

Challenge 1: Unstable Network Environment

Weak network: Mobile signal fluctuations, unstable Wi‑Fi, and high latency across regions cause low bandwidth and high packet loss.

Power supply not guaranteed: Many devices rely on batteries or may experience sudden power loss.

Severe resource constraints: Edge devices have limited CPU, memory, storage, and network bandwidth.

Challenge 2: Reliable Delivery of Observability Data

Data loss risk: Network interruptions, power outages, or process crashes can discard data.

Order guarantee: Time‑series data (metrics, traces) must preserve the collection order.

Challenge 3: Bandwidth Limitation

High traffic cost: 4G/5G data fees are far higher than data‑center dedicated lines.

Bandwidth competition: Collection traffic competes with business traffic for limited bandwidth.

Upload rate limits: Some networks impose strict upload caps.

LoongCollector Overview

LoongCollector is an open‑source, high‑performance, highly reliable observability data collector from Alibaba Cloud. It has been proven in Alibaba Cloud’s internal deployment of millions of instances and is specially optimized for edge scenarios.

Core Capabilities

Host monitoring: Real‑time collection of CPU, memory, disk, network and >100 system metrics.

Prometheus protocol: Full compatibility with the Prometheus ecosystem, supporting all Prometheus‑compatible applications.

Log collection: Efficient text‑log ingestion with multiple formats and parsers.

Ultra‑Low Resource Consumption

LoongCollector is heavily optimized for devices with scarce resources, allowing more collection tasks on the same hardware or stable operation on extremely constrained devices.

LoongCollector architecture diagram
LoongCollector architecture diagram

Solution Architecture: Persistence + Asynchronous Sending + Intelligent Retry

Data is first written to local files (persistence), then a dedicated sender thread reads the files and transmits data (asynchronous sending). This decouples collection from network state, ensuring no data loss during power cuts or crashes.

Local Persistence

All metric data is stored in files. A fine‑grained checkpoint records the read offset of each file, so after a crash or power loss the collector resumes from the exact point without data loss.

Asynchronous Consumption

The sender thread reads persisted files in order, guaranteeing that data is sent in the same chronological order it was collected. File rotation and sequence numbers ensure correct ordering across multiple files.

Smart Back‑Pressure and Flow Control

Queue back‑pressure: When the send queue reaches a threshold, file reading is paused to prevent memory explosion.

Traffic limiting: The max_bytes_per_sec parameter caps the outbound bandwidth, protecting business traffic.

Adaptive concurrency: Inspired by TCP congestion control, LoongCollector dynamically adjusts the number of concurrent senders based on network conditions, providing fast response, quick convergence, and automatic recovery.

Back‑pressure and flow control diagram
Back‑pressure and flow control diagram

Configuration Examples

A typical edge deployment includes a host‑monitor input and a Prometheus input, each flushed to a local file.

{
  "discard_old_data": false,
  "config_server_lost_connection_timeout": 604800,
  "force_quit_read_timeout": 604800,
  "max_bytes_per_sec": 1048576,
  "cpu_usage_limit": 0.4,
  "mem_usage_limit": 384,
  "working_ip": "192.168.0.1"
}
enable: true
inputs:
  - Type: input_host_monitor
    Interval: 15
flushers:
  - Type: flusher_file
    MaxFileSize: 104857600
    MaxFiles: 10
    FilePath: /usr/local/ilogtail/metrics/host.log
enable: true
inputs:
  - Type: input_prometheus
    ScrapeConfig:
      job_name: node
      host_only_mode: true
      scrape_interval: 15s
      scrape_timeout: 10s
      static_configs:
        - targets: ["localhost:12345"]
flushers:
  - Type: flusher_file
    MaxFileSize: 524288000
    MaxFiles: 10
    FilePath: /usr/local/ilogtail/metrics/metric.log

Performance Test Results

On a representative edge device, LoongCollector exhibits minimal resource usage while staying within the configured bandwidth limit.

CPU: average 0.02 core, peak 0.028 core.

Memory: average 31.5 MB, peak 35 MB.

Network (after compression): average 1.07 KB/s, peak 1.10 KB/s (raw data before back‑pressure was ~13 KB/s).

CPU and memory usage chart
CPU and memory usage chart
Network traffic before and after compression
Network traffic before and after compression

Conclusion and Outlook

LoongCollector effectively tackles edge‑observability challenges by guaranteeing reliable data delivery, providing local persistence, decoupling collection from sending, and implementing intelligent back‑pressure and flow control. Nevertheless, further improvements are planned:

Simplify pipeline configuration by integrating persistence directly into a single pipeline.

Add support for Alibaba Cloud STS temporary credentials to avoid AccessKey leakage.

Explore more aggressive compression algorithms to further reduce traffic costs.

Future work illustration
Future work illustration
Performancecloud-nativeEdge computingobservabilitydata-collection
Alibaba Cloud Observability
Written by

Alibaba Cloud Observability

Driving continuous progress in observability technology!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.