Operations 31 min read

How iLogtail Achieves Million‑Scale Observability with SRE Best Practices

This article explains how iLogtail, Alibaba Cloud's high‑performance observability agent, tackles reliability challenges at million‑scale deployments through a comprehensive SRE workflow that spans design, development, testing, gray‑release, operations, and continuous customer support, all while leveraging cloud‑native tools and automation.

Alibaba Cloud Observability
Alibaba Cloud Observability
Alibaba Cloud Observability
How iLogtail Achieves Million‑Scale Observability with SRE Best Practices

Based on a public live broadcast by iLogtail PMC member Yu Tao on June 26, 2024, the article outlines the background of reliability engineering in today’s competitive cloud market, where cost reduction and quality become decisive factors.

Reliability Challenges of a Million‑Scale Agent

iLogtail, a lightweight, high‑performance data‑collection agent used across Alibaba’s core products (Taobao, Tmall, Alipay, etc.), has reached tens of millions of installations and processes dozens of petabytes of data daily. Its stability is critical because any failure directly impacts key business services.

Specific Stability Challenges

Rapid business iteration and diverse environments (new data sources, outputs, processing methods, and environment adaptations).

Weak client‑side control (version divergence, customer environment factors, operator errors).

Massive deployment scale (millions of instances across groups, affecting large regions with a single release).

Reliability Goals for the Agent

Data integrity : ensure complete data capture even after transient failures.

Reliability : the agent must keep running under high load or abnormal conditions without crashing.

Performance impact : resource usage must stay within acceptable limits.

Real‑time : timely data processing for near‑real‑time analysis.

Reliability Engineering Elements & Methodology

The team emphasizes four pillars: awareness, standards, technology, and mechanisms. Awareness cultivates a quality‑first mindset; standards provide best‑practice guidelines; technology (automation, CI/CD, monitoring tools) enforces those standards; mechanisms (organizational, incentive, and governance structures) ensure consistent execution.

Testing Strategy

Testing is divided into design, development, testing, release, and operations phases.

Design Phase

iLogtail adopts a plugin‑based architecture and a programmable SPL language, allowing users to compose processing pipelines from reusable plugins or write custom logic, reducing the need for frequent code changes.

Development Phase

Two parallel development models are used: open‑source branches for feature development and commercial branches for stability. Coding standards follow Google C++ Style Guide and Effective Go, enforced by containerized development environments with clang‑format and gofmt.

Testing Phase

Given the agent’s diverse runtime environments, the team employs static analysis (Coverity, golangci‑lint), unit tests (using orthogonal array testing to cover parameter combinations), functional tests, compatibility tests, regression tests, and performance tests. An in‑house E2E framework orchestrates environment setup (Docker, Alibaba Cloud ECS/ACK), fault injection (ChaosBlade), data generation, verification, and cleanup, dramatically improving test efficiency.

Release Phase

Gray‑release is used to limit impact, with dimensions such as cluster, user, region, and tier. Continuous monitoring of resource usage, restart frequency, and data volume determines whether the release proceeds to the next stage.

Operations Phase

SLA/SLO/SLI metrics are defined and visualized through Alibaba Cloud SLS, forming an intelligent observability platform that aggregates logs, traces, and metrics for proactive incident detection.

Customer Support & Operation

Feedback loops with the technical service team feed high‑frequency issues back into product improvement. Self‑service tools like Cloud Lens for SLS and container metadata preview help users resolve configuration problems, reducing ticket volume by ~30% and heartbeat‑related tickets by ~80%.

Future Outlook

Large language models are expected to further automate root‑cause analysis and data‑ingestion configuration, making SRE roles increasingly full‑stack and AI‑augmented.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativetestingDevOpsSREreliability engineering
Alibaba Cloud Observability
Written by

Alibaba Cloud Observability

Driving continuous progress in observability technology!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.