Cloud Native 16 min read

How LoongCollector Delivers 10× Throughput and 80% Resource Savings in Cloud‑Native Observability

LoongCollector, the open‑source cloud‑native collector behind Alibaba Cloud's Simple Log Service, achieves ten‑fold higher throughput, up to 80% lower CPU and memory usage, near‑linear scaling, zero‑copy processing, lock‑free event pools and adaptive concurrency, while guaranteeing enterprise‑grade reliability for petabyte‑scale log and metric ingestion.

Alibaba Cloud Observability
Alibaba Cloud Observability
Alibaba Cloud Observability
How LoongCollector Delivers 10× Throughput and 80% Resource Savings in Cloud‑Native Observability

Background

Alibaba Cloud has operated large‑scale cloud services for over a decade. To handle the explosion of observability data from AI workloads, container orchestration, and multi‑region deployments, the in‑house collector was evolved from iLogtail to LoongCollector (formerly LoongCollector). The goal is a collector that can ingest petabyte‑level data per day with high throughput, low resource consumption, and strong reliability.

Performance Benchmark

A reproducible benchmark (see the benchmark e2e README in the LoongCollector GitHub repository) was executed on an Alibaba Cloud ECS g7 instance (32 vCPU, 64 GB RAM, Ubuntu 20.04, ESSD PL3 1500 GiB). LoongCollector achieved a maximum throughput of 546 MB/s for single‑line logs, far exceeding Fluent Bit (36 MB/s), Vector (38 MB/s) and Filebeat (9 MB/s). Under a steady 10 MB/s workload the CPU usage was only 3.40 % (Fluent Bit 12.29 %) and memory consumption dropped by up to 80 % . The collector scales near‑linearly with CPU cores and maintains stable latency at peak load.

Log Type      LoongCollector   FluentBit   Vector   Filebeat
Single line   546 MB/s         36 MB/s     38 MB/s   9 MB/s
Multi line    238 MB/s         24 MB/s     22 MB/s   6 MB/s
Regex parse   68 MB/s          19 MB/s     12 MB/s   not supported

Key Architectural Optimizations

1. Zero‑Copy Memory Handling

Traditional collectors copy log strings multiple times during parsing, causing CPU overhead and memory fragmentation. LoongCollector stores the raw log once in a shared SourceBuffer and uses string_view to reference slices. This eliminates per‑field copies (four copies per event become zero), reduces CPU usage by ~15 % and cuts memory allocation by ~80 %.

2. Lock‑Free Event Pools

Instead of allocating a new PipelineEvent for each log entry, LoongCollector reuses objects via a thread‑aware lock‑free pool. Each processing thread has its own pool for direct reuse; a double‑buffer pool handles cross‑thread transfers, reducing synchronization overhead and object allocation by ~90 %.

┌──────────────────┐
│ Processor Thread │─────[Lock‑free Pool]─────Direct Reuse
└──────────────────┘

┌────────────────┐   ┌─────────────────┐
│ Input Thread   │──▶│ Processor Thread│
└────────────────┘   └─────────────────┘
          │               │
          └──[Double Buffer Pool]──┘

3. Zero‑Copy Serialization

Most collectors build an intermediate protobuf object before serialization, incurring extra copies. LoongCollector writes directly in protobuf wire format from the PipelineEventGroup, bypassing intermediate objects. This reduces serialization CPU cost by 54 % and memory copies by 67 %.

4. Reliability Architecture

LoongCollector introduces per‑pipeline bounded queues with a high‑low watermark feedback system. When a queue exceeds the high watermark, upstream producers receive back‑pressure; when the size falls below the low watermark, flow resumes. This prevents a slow pipeline from blocking others. A priority‑aware round‑robin scheduler guarantees that high‑priority pipelines are always served first, while lower‑priority pipelines share remaining capacity fairly.

High‑Low Watermark Feedback System
 ┌─ Queue State Management ──┐   ┌─ Feedback Mechanism ──┐
 │ Normal (size < low)       │   │ Upstream check before │
 │ Accept all data            │   │ write                 │
 └───────────────────────────┘   └───────────────────────┘
          │
          ▼ (size ≥ high)
          └─ Stop accepting non‑urgent data
          ▼ (size ≤ low)
          └─ Resume accepting data

Priority‑aware round‑robin example:

High Priority
 ──► Pipeline1 (always first)

Medium Priority (round‑robin)
 ──► Pipeline2 → Pipeline3 → Pipeline4 → ...

Low Priority (round‑robin)
 ──► Pipeline5 → Pipeline6 → ...

5. Adaptive Concurrency Limiter (AIMD)

Each destination endpoint has an AIMD‑style concurrency limiter. Failure‑rate thresholds adjust concurrency multiplicatively: 0‑10 % → maintain, 10‑40 % → multiply by 0.8 (slow decrease), >40 % → multiply by 0.5 (fast decrease). Successful windows trigger additive increase (+1) until the maximum is restored, providing rapid “stop‑bleeding” and graceful recovery.

Concurrency Limiter
 ├─ No fallback (0‑10 %): keep max concurrency
 ├─ Slow fallback (10‑40 %): concurrency ×0.8
 └─ Fast fallback (40‑100 %): concurrency ×0.5
Additive increase on 100 % success → +1 per window

Production Validation

LoongCollector is a core component of Alibaba Cloud Simple Log Service (SLS). It runs on millions of production instances, ingesting hundreds of petabytes of logs, metrics and traces per day across more than 50 regions. Real‑world stress tests show:

Scalability: successful in clusters with >1,000,000 instances; a single node can run 2k+ concurrent pipelines with millisecond latency.

Network Resilience: zero‑data‑loss guarantees with up to 6‑hour buffering; adaptive concurrency limits isolate failures to the affected destination.

Chaos Engineering: the system survives random pipeline failures, 10× traffic spikes, and CPU/Memory/IO saturation near 90 % without service degradation.

Technical Comparison with Open‑Source Collectors

The benchmark compares LoongCollector with Fluent Bit, Vector and Filebeat on the same hardware. For a 10 MB/s single‑line workload:

Metric                LoongCollector   FluentBit   Vector   Filebeat
CPU usage               3.40 %          12.29 %    35.80 %   83.24 %
Memory (MB)            29.01 MB        46.84 MB   83.24 MB  performance‑insufficient
Serialization CPU      5.8 %           12.5 %    12.5 %   N/A
Memory copies          1× (zero)       3×       4×      N/A

Key Takeaways

≈10× higher maximum throughput enables the same hardware to process far more data.

≈80 % reduction in CPU and memory usage translates directly into cost savings.

Near‑linear scaling simplifies capacity planning.

Reliability mechanisms (watermark feedback, priority scheduling, AIMD) ensure zero data loss even under network congestion.

Native multi‑protocol support (logs, metrics, traces, eBPF) is achieved without sacrificing performance.

References

LoongCollector benchmark scripts:

https://github.com/alibaba/loongcollector/blob/main/test/benchmark/e2e/README.md

Alibaba Cloud Simple Log Service product page:

https://www.alibabacloud.com/en/product/log-service
cloud-nativeobservabilityzero-copylock‑freeHigh ThroughputLoongCollectoradaptive-concurrency
Alibaba Cloud Observability
Written by

Alibaba Cloud Observability

Driving continuous progress in observability technology!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.