Big Data 12 min read

How We Scaled Real‑Time Log Analysis to 2 TB Daily with ELK

This article shares the author's practical experience building a real‑time log analysis platform at Sina, covering service scope, ELK architecture, performance optimizations, usability improvements, new features, common pitfalls, and a concise Q&A for engineers handling massive log streams.

21CTO
21CTO
21CTO
How We Scaled Real‑Time Log Analysis to 2 TB Daily with ELK

Service Overview

Since joining Sina in early 2014, the author has been working on real‑time log analysis using the ELK stack (Elasticsearch, Logstash, Kibana). The service now supports more than ten internal products—including Weibo, Weipan, cloud storage, and elastic compute—processing roughly 32 billion (2 TB) log entries per day.

Technical Architecture

The architecture follows a classic ELK pipeline:

Kafka as the message queue that receives user logs.

Logstash parses logs and normalizes them into JSON for Elasticsearch.

Elasticsearch serves as the schemaless, real‑time data store with powerful search and aggregation capabilities.

Kibana provides data visualization on top of Elasticsearch.

Improving Service Quality

Optimizations were applied at three levels:

Hardware/System: Enabled hyper‑threading, disabled swap, increased max open files.

Application: Tuned Java version, ES_HEAP_SIZE, bulk queue size, set default index templates (shard count, replica count, not_analyzed strings, doc_values) to avoid OOM.

Index Management: Built an independent Elasticsearch index management system using Celery for distributed scheduling of create, optimize, close, delete, and snapshot tasks.

A HDFS snapshot plugin was added to back up indices, primarily Kibana configuration indices. Monitoring combines internal sinawatch for system‑level alerts and custom scripts for application‑level metrics such as JVM heap usage, Kibana availability, Kafka consumer lag, and log parsing failures. The commercial Marvel plugin was not adopted.

Enhancing Usability

Key usability improvements include:

Accurate IP‑to‑region/ISP mapping by replacing the default MaxMind free DB with a custom binary DB (maxmindDB) and a new Logstash filter (logstash‑filter‑geoip2).

Automation of log ingestion through three steps: a simple UI for users to define log formats, a Python API that auto‑generates Logstash configuration, and an upcoming Docker‑based deployment pipeline.

Better visualization support: after initial use of Kibana v3, a customized Kibana 3 version addressed multi‑group‑by and percentage calculations; migration to Kibana 4 leverages Elasticsearch aggregations for richer dashboards.

Providing New Features

The Chinese IK analyzer plugin (elasticsearch‑analysis‑ik) was installed, enabling proper tokenization of Chinese terms (e.g., treating “中国” as a single token), which improves search relevance for Chinese content.

Common Pitfalls

Elasticsearch JVM heap usage > 90 % leading to frequent GC pauses and node restarts; mitigated by enabling doc_values, limiting query heap consumption, restricting analyzed strings to queries only, and closing unused indices.

Learning curve for Elasticsearch Query DSL, facets, and aggregations; recommended to inspect Kibana request bodies or use Marvel’s autocomplete features.

Logstash failures caused by unofficial plugins or extensive Ruby filters; advice is to prefer official plugins and monitor Kafka consumer lag or Elasticsearch indexing rate as indirect health checks.

Kibana lacks native multi‑tenant isolation, causing dashboard conflicts among users; some custom solutions and snapshot backups to HDFS have been employed.

High communication overhead when negotiating log formats and visualization requirements; ongoing work on automated ingestion and user training aims to reduce this cost.

Q & A

Q: Why does Logstash‑to‑Elasticsearch sometimes timeout? A: Timeouts often occur when Elasticsearch JVM heap usage is high; increasing memory and avoiding aggregations on analyzed strings helps.

Q: How to monitor log parsing errors? A: Search for _grokparsefailure and _jsonparsefailure fields in Elasticsearch; use watch or tools like elastalert for alerts.

Q: Difference between a Hadoop‑based big‑data platform and Kibana visualizations? A: Hadoop processes offline batch jobs, while Elasticsearch provides real‑time search and aggregation; Kibana offers immediate visual feedback, unlike Hadoop’s offline reports.

Q: Do you separate data and query nodes in the ES cluster? A: Not yet; the current setup uses HTTP protocol to write directly to query nodes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringReal-TimeElasticsearchKafkaELKlog analysis
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.