Big Data 30 min read

How Tianyan Beats ELK: Inside a High‑Performance Distributed Log Service

This article analyzes the challenges of logging in distributed services, compares the traditional ELK stack with Baidu's Tianyan platform, and details Tianyan's architecture, data collection, high‑throughput transmission, storage, retrieval, resource isolation, dynamic cleanup, and best‑practice recommendations, complete with code examples and performance insights.

Architect
Architect
Architect
How Tianyan Beats ELK: Inside a High‑Performance Distributed Log Service

Distributed Log Service Challenges

In a micro‑service environment each service runs on many hosts and generates massive log streams. The main technical problems are:

Huge volume – logs are scattered across nodes and must be reliably collected and aggregated.

Diverse formats – different frameworks emit different fields, orders and levels, increasing parsing and indexing complexity.

Scalability & reliability – the logging pipeline must scale horizontally and vertically while keeping latency low.

ELK Stack Overview

The Elastic Stack (Elasticsearch, Logstash, Kibana) is the de‑facto solution for log ingestion, processing and visualization. Its typical pipeline is:

Ingest – gathers raw log data (files, syslog, network).

Beats / Elastic Agent – lightweight shippers that forward logs, metrics and security events.

Logstash – real‑time data collection engine with rich input, filter and output plugins.

Elasticsearch – distributed search and analytics engine with a REST API.

Kibana – visual UI for query and dashboard.

Tianyan vs. ELK

A side‑by‑side comparison (observed in production) shows five decisive dimensions:

Integration effort – ELK requires a full deployment workflow and lengthy configuration; Tianyan needs only three steps: product‑line registration, app‑key retrieval, and SDK dependency addition.

Resource customization – ELK changes need a service restart and cannot select multiple resources; Tianyan allows per‑product‑line selection of transmission and storage resources with immediate effect.

Scaling cost & efficiency – ELK supports a single business line per stack instance; adding another line means deploying a new stack. Tianyan centralizes resources, supports dynamic product‑line onboarding, and lets resources be shared or exclusive via UI.

Dynamic log cleanup – ELK relies on manual discovery and deletion. Tianyan automatically monitors ES cluster usage and deletes the oldest indices when thresholds are hit.

Adaptive storage – ELK stores logs directly in ES, leading to high storage cost and limited retention. Tianyan offloads old logs to low‑cost object storage (BOS) and restores them on demand, achieving low cost and long retention.

Current production metrics: daily log volume ≈ 10 TB, QPS > 100 k, supporting > 1 000 product lines.

Tianyan System Architecture

The platform consists of four stages that are tightly decoupled to achieve low latency and high throughput.

Log Collection – two mechanisms:

SDK that implements log‑framework appenders (log4j, logback, log4j2). Each LogEvent is enriched with a product‑line identifier for isolation.

File listener that parses log files with regular‑expression rules (zero‑code integration).

High‑Performance Transmission – logs are first placed into a lock‑free Disruptor ring buffer, then into a secondary Bigpipe queue for asynchronous decoupling. This eliminates lock contention and enables millions of messages per second. Key design of the Disruptor queue:

Fixed‑size array for cache‑friendly memory access.

Array pre‑fill to avoid GC overhead.

Cache‑line padding to prevent false sharing.

Bit‑wise indexing for O(1) enqueue/dequeue.

If both queues fail, a memory‑mapped file queue BigQueue provides a durable fallback.

Storage – a consumer polls Bigpipe, batches logs with BulkProcessor, and writes to Elasticsearch. Transmission and storage resources are isolated per product line, preventing cross‑service contention.

Dynamic Cleanup & Adaptive Storage – when ES usage exceeds a configurable threshold, the system automatically deletes the oldest indices, takes snapshots, and moves them to BOS (low‑cost object storage). Snapshots are retained for up to 180 days and can be restored on demand.

Tianyan architecture overview
Tianyan architecture overview

Log Collection Details

The SDK supports major Java logging frameworks. Below is a Logback appender that forwards filtered events to a thread pool:

public class LogClientAppender<E> extends AppenderBase<E> {
    private static final Logger LOGGER = LoggerFactory.getLogger(LogClientAppender.class);

    @Override
    protected void append(E eventObject) {
        ILoggingEvent event = filter(eventObject);
        if (event != null) {
            MessageLogSender.getExecutor().submit(
                new LogbackTask(event, LogNodeFactory.getLogNodeSyncDto()));
        }
    }
}

Trace logs are captured via interceptors. The MyBatis interceptor builds a SqlLogNode and pushes it to the queue:

TraceFactory.getSqltracer().end(returnObj, className, methodName,
    realParams, dbType, sqlType, sql, sqlUrl);

Registration of the interceptor is manual:

sqlSessionFactory.getConfiguration().addInterceptor(new IlogMybatisPlugin());

High‑Concurrency Transmission

ELK pipelines are synchronous: logs flow directly from Logstash to ES, making throughput dependent on both services. Tianyan decouples the path using Disruptor and Bigpipe, achieving tens of millions of events per second.

Disruptor – lock‑free inter‑thread messaging based on a RingBuffer.

Bigpipe – distributed middleware supporting Topic and Queue models; guarantees no loss and no duplication.

BigQueue – memory‑mapped file queue used as a fallback when primary queues fail.

Transmission pipeline
Transmission pipeline

Log Retrieval

Stored logs are searchable via a visual UI that filters by product‑line ID, time range, log level and five query types. Example DSL for a text query:

{
  "query": {
    "bool": {
      "must": [{
        "multi_match": {
          "query": "searchValue",
          "fields": ["message", "exception"],
          "type": "best_fields"
        }
      }]
    }
  }
}

Term query (keyword analyzer):

{
  "query": {
    "bool": {
      "must": [{
        "multi_match": {
          "query": "searchValue",
          "fields": ["message", "exception"],
          "type": "best_fields",
          "analyzer": "keyword"
        }
      }]
    }
  }
}

Phrase, prefix and logical queries follow similar JSON structures with parameters such as slop for phrase tolerance and max_expansions for prefix expansion.

Resource Isolation

To prevent contention among thousands of product lines, Tianyan isolates transmission and storage resources per line. The workflow:

Business services generate logs (SDK or Minos).

After a product line configures its ES and Bigpipe resources, a listener creates or destroys the corresponding ES client and Bigpipe client.

Log subscribers receive logs, apply product‑line filter rules, and push events into an in‑memory channel.

Dispatchers pull from the channel and write to ES; failures trigger back‑off strategies.

Users query logs via the UI.

Resource isolation diagram
Resource isolation diagram

Dynamic Cleanup & Storage Downgrade

When ES cluster usage exceeds a configured threshold, Tianyan automatically deletes the oldest indices until usage falls below the limit. For long‑term retention, snapshots are taken and stored in BOS for up to 180 days. On demand, a snapshot is restored to ES for querying.

Periodically query cluster health and usage.

Calculate index size and delete oldest indices until the threshold is met.

Take snapshots of remaining indices and copy them to BOS.

When a historic query is issued, retrieve the snapshot from BOS and restore it to ES.

Cleanup & downgrade flow
Cleanup & downgrade flow

Best Practices & Operational Insights

Generate a unique product‑line identifier and tag every log event.

Choose SDK for zero‑code integration or Minos for file‑based ingestion; both enrich fields automatically.

Configure per‑product‑line transmission and storage resources to avoid cross‑service contention.

Define log‑filter rules (by content, name, or both) in the UI; the platform applies them during transmission.

Leverage Disruptor and Bigpipe for lock‑free, high‑throughput pipelines.

Enable automatic cleanup and storage downgrade to keep ES clusters healthy and cost‑effective.

Conclusion

A well‑designed distributed log platform such as Tianyan solves the scalability, performance and manageability challenges of enterprise‑level logging. By combining SDK‑based collection, lock‑free high‑throughput pipelines, per‑product‑line resource isolation, and automated lifecycle management, the system delivers fast, reliable and cost‑efficient log services for modern micro‑service ecosystems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsJavamonitoringBig DataElasticsearchhigh concurrencyELKlog aggregation
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.