Operations 6 min read

Design and Experience of a Near Real-Time Log System Based on Kafka and Elasticsearch

This article describes the architecture, deployment, configuration, maintenance, and performance results of a large‑scale near real‑time logging platform built with Kafka, Flume, and Elasticsearch, highlighting practical lessons and future plans for resource‑efficient operation.

Tongcheng Travel Technology Center

May 11, 2017

Design and Experience of a Near Real-Time Log System Based on Kafka and Elasticsearch

Background

Log data is critical for developers and operations to troubleshoot issues, and fast search is a core requirement. Among open‑source solutions, the ELK (now Elastic) Stack is the most popular. Elasticsearch provides scalable, real‑time search and analytics on top of Apache Lucene.

Based on company needs we built a near real‑time log system centered on Kafka and Elasticsearch, capable of handling up to hundreds of billions of logs per day.

1. Overall Architecture

To maximize resource utilization we mixed CPU‑, memory‑, and I/O‑intensive services on the same machines. The deployment diagram (image) shows independent log collection and transport layers per data center, with Kafka, Flume, and an ES client node co‑located on a single server, reducing hardware cost, cross‑data‑center traffic, and providing disaster recovery.

2. Kafka Usage Summary

Kafka delivers excellent performance and stability. Topics should be partitioned carefully; high‑volume applications are isolated in separate topics to avoid impacting normal traffic. The number of partitions influences overall throughput and must be tuned to the data scale.

3. Flume Usage Summary

We customized an Elasticsearch sink for first‑stage filtering and write concurrency control, a Kafka source for dynamic configuration and online/offline switching, and a selector that dynamically chooses target channels and consumer threads, with memory and file channels switching according to policy.

4. Elasticsearch Usage Summary

Key configurations that improve throughput include enabling transport compression ( transport.tcp.compress: true) and setting index.translog.durability: async. Experience notes: all nodes must run the same Java minor version; write‑concurrency limits are essential to avoid Full GC under heavy load; Elasticsearch 5.0 is still maturing for large‑scale deployments.

5. Elasticsearch Cluster Maintenance

Monitor shard size and balance, use routing for large‑index projects, schedule hot‑cold node migrations during off‑peak hours, and regularly pull cluster health via REST APIs. Recommended practices include using Bulk+Search thread pools, monitoring segment size and forcing merges, and keeping heap usage in check.

6. Performance Metrics

The cluster consists of three mid‑range machines per IDC for Kafka, Flume, and ES client, plus ten high‑end machines for Elasticsearch. It processes roughly 9 × 10⁹ logs per day on average, with peak throughput around 400 k TPS.

7. Future Plans

We plan to containerize mixed‑deployment nodes to isolate resources, leverage new Elasticsearch 5.0 features for rapid scaling, add nodes during write peaks, and release idle machines for other services to further improve resource utilization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance operations Elasticsearch Kafka Log Management

Written by

Tongcheng Travel Technology Center

Pursue excellence, start again with Tongcheng! More technical insights to help you along your journey and make development enjoyable.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.