How Baidu’s Tianyan Log Service Overcomes ELK’s Scaling and Performance Limits
This article examines the challenges of logging in distributed services, compares the traditional ELK stack with Baidu's Tianyan solution, details Tianyan's architecture—including Ingest, Store, Consumer, Elastic Agent, Fleet, APM, Beats, and Disruptor‑based high‑throughput pipelines—covers resource isolation, dynamic cleanup, and best‑practice recommendations for building a scalable, low‑latency log platform.
Distributed Service Log Challenges
In large distributed systems each server generates massive logs, creating three core problems: huge log volume across many nodes, heterogeneous log formats, and the need for a scalable, reliable log service that can grow horizontally and vertically.
Typical ELK Solution
The Elastic Stack (ELK) is the common open‑source approach. It consists of Logstash for ingestion, Elasticsearch for storage and search, and Kibana for visualization. While ELK provides powerful search capabilities, it suffers from complex deployment, high resource consumption, limited multi‑tenant isolation, and difficulty scaling to tens of terabytes per day.
Tianyan Log Service Architecture
Tianyan, Baidu’s proprietary log platform, addresses ELK’s shortcomings with a modular architecture that separates ingestion, transport, storage, and consumption while integrating high‑performance components.
Ingest Component
The Ingest layer gathers logs via shippers (e.g., Elastic Agent, Beats) and sources (log files, syslog, network streams). It uses queues and processors to normalize and enrich data before forwarding.
Elastic Agent & Fleet
Elastic Agent provides a unified way to collect logs, metrics, and security data. Fleet centrally manages agents, their policies, and upgrades, allowing administrators to monitor agent health and push new integrations.
Elastic APM & Beats
Elastic APM captures application‑level performance metrics, while Beats act as lightweight data shippers for logs, metrics, network traffic, and Windows events.
Elasticsearch Ingest Pipelines
Before indexing, pipelines apply a series of processors (e.g., grok, date, rename) to transform raw log entries into a searchable schema.
Logstash
Logstash serves as a real‑time data collection engine, supporting a rich plugin ecosystem for inputs, filters, and outputs.
Store Component
The Store layer relies on Elasticsearch as the distributed search and analytics engine, offering near‑real‑time indexing for both structured and unstructured data.
Consumer Component
Consumers read from Elasticsearch via Kibana visualizations or direct client APIs (Java, Go, Python, etc.) to provide searchable dashboards and programmatic access.
High‑Concurrency Transport
Tianyan replaces ELK’s synchronous Logstash pipeline with an asynchronous design:
Log events are placed into a high‑performance Disruptor ring buffer, eliminating lock contention.
A secondary Bigpipe queue decouples transport from storage, enabling durable, back‑pressure‑aware processing.
BigQueue (memory‑mapped file queue) provides crash‑safe fallback when in‑memory queues overflow.
These queues achieve millions of QPS with sub‑millisecond latency.
Resource Isolation
Each product line receives a unique identifier that tags every log event. During onboarding, users select dedicated transport and storage resources, preventing noisy‑neighbor effects. The platform dynamically provisions or releases resources based on configuration changes.
Dynamic Cleanup & Storage Tiering
Tianyan monitors Elasticsearch cluster usage. When disk usage exceeds thresholds, an automated cleanup removes the oldest indices. For long‑term retention, snapshots are taken and off‑loaded to low‑cost object storage (BOS) with a 180‑day TTL. Queries against archived data trigger snapshot restoration on demand.
Search Capabilities
The platform supports five query types on message and exception fields:
Full‑text (multi_match best_fields)
Exact value (keyword analyzer)
Phrase search (type: phrase with slop)
Prefix search (type: phrase_prefix, max_expansions)
Logical query (simple_query_string with operators +, -, |, *, etc.)
Users can filter by log source ID, time range, log level, and custom criteria via the UI.
Best Practices & Operational Guidance
Key recommendations include:
Prefer SDK‑based ingestion for structured logs; use Minos file‑tailing for legacy services.
Leverage Disruptor and Bigpipe for high‑throughput pipelines.
Configure per‑product‑line resource isolation to avoid contention.
Enable automatic cleanup thresholds and snapshot‑based tiering to control storage costs.
Define concise filter rules (content, name, or combined) to reduce noise before logs enter the pipeline.
Conclusion
By combining a modular Elastic Stack core with Baidu‑specific high‑performance queues, dynamic resource isolation, and automated lifecycle management, Tianyan delivers a low‑latency, scalable log service suitable for enterprise‑level distributed applications, while also integrating large‑model‑driven query assistance for faster troubleshooting.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
