Operations 26 min read

How Baidu’s Tianyan Log Service Overcomes ELK’s Scaling and Performance Limits

This article examines the challenges of logging in distributed services, compares the traditional ELK stack with Baidu's Tianyan solution, details Tianyan's architecture—including Ingest, Store, Consumer, Elastic Agent, Fleet, APM, Beats, and Disruptor‑based high‑throughput pipelines—covers resource isolation, dynamic cleanup, and best‑practice recommendations for building a scalable, low‑latency log platform.

Baidu Geek Talk

Jun 19, 2023

How Baidu’s Tianyan Log Service Overcomes ELK’s Scaling and Performance Limits

Distributed Service Log Challenges

In large distributed systems each server generates massive logs, creating three core problems: huge log volume across many nodes, heterogeneous log formats, and the need for a scalable, reliable log service that can grow horizontally and vertically.

Typical ELK Solution

The Elastic Stack (ELK) is the common open‑source approach. It consists of Logstash for ingestion, Elasticsearch for storage and search, and Kibana for visualization. While ELK provides powerful search capabilities, it suffers from complex deployment, high resource consumption, limited multi‑tenant isolation, and difficulty scaling to tens of terabytes per day.

Tianyan Log Service Architecture

Tianyan, Baidu’s proprietary log platform, addresses ELK’s shortcomings with a modular architecture that separates ingestion, transport, storage, and consumption while integrating high‑performance components.

Ingest Component

The Ingest layer gathers logs via shippers (e.g., Elastic Agent, Beats) and sources (log files, syslog, network streams). It uses queues and processors to normalize and enrich data before forwarding.

Elastic Agent & Fleet

Elastic Agent provides a unified way to collect logs, metrics, and security data. Fleet centrally manages agents, their policies, and upgrades, allowing administrators to monitor agent health and push new integrations.

Elastic APM & Beats

Elastic APM captures application‑level performance metrics, while Beats act as lightweight data shippers for logs, metrics, network traffic, and Windows events.

Elasticsearch Ingest Pipelines

Before indexing, pipelines apply a series of processors (e.g., grok, date, rename) to transform raw log entries into a searchable schema.

Logstash

Logstash serves as a real‑time data collection engine, supporting a rich plugin ecosystem for inputs, filters, and outputs.

Store Component

The Store layer relies on Elasticsearch as the distributed search and analytics engine, offering near‑real‑time indexing for both structured and unstructured data.

Consumer Component

Consumers read from Elasticsearch via Kibana visualizations or direct client APIs (Java, Go, Python, etc.) to provide searchable dashboards and programmatic access.

High‑Concurrency Transport

Tianyan replaces ELK’s synchronous Logstash pipeline with an asynchronous design:

Log events are placed into a high‑performance Disruptor ring buffer, eliminating lock contention.

A secondary Bigpipe queue decouples transport from storage, enabling durable, back‑pressure‑aware processing.

BigQueue (memory‑mapped file queue) provides crash‑safe fallback when in‑memory queues overflow.

These queues achieve millions of QPS with sub‑millisecond latency.

Resource Isolation

Each product line receives a unique identifier that tags every log event. During onboarding, users select dedicated transport and storage resources, preventing noisy‑neighbor effects. The platform dynamically provisions or releases resources based on configuration changes.

Dynamic Cleanup & Storage Tiering

Tianyan monitors Elasticsearch cluster usage. When disk usage exceeds thresholds, an automated cleanup removes the oldest indices. For long‑term retention, snapshots are taken and off‑loaded to low‑cost object storage (BOS) with a 180‑day TTL. Queries against archived data trigger snapshot restoration on demand.

Search Capabilities

The platform supports five query types on message and exception fields:

Full‑text (multi_match best_fields)

Exact value (keyword analyzer)

Phrase search (type: phrase with slop)

Prefix search (type: phrase_prefix, max_expansions)

Logical query (simple_query_string with operators +, -, |, *, etc.)

Users can filter by log source ID, time range, log level, and custom criteria via the UI.

Best Practices & Operational Guidance

Key recommendations include:

Prefer SDK‑based ingestion for structured logs; use Minos file‑tailing for legacy services.

Leverage Disruptor and Bigpipe for high‑throughput pipelines.

Configure per‑product‑line resource isolation to avoid contention.

Enable automatic cleanup thresholds and snapshot‑based tiering to control storage costs.

Define concise filter rules (content, name, or combined) to reduce noise before logs enter the pipeline.

Conclusion

By combining a modular Elastic Stack core with Baidu‑specific high‑performance queues, dynamic resource isolation, and automated lifecycle management, Tianyan delivers a low‑latency, scalable log service suitable for enterprise‑level distributed applications, while also integrating large‑model‑driven query assistance for faster troubleshooting.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Monitoring High concurrency log management backend infrastructure Elastic Stack Resource Isolation

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.