Operations 13 min read

How Dada Scaled Log Processing to 130 Billion Entries Daily with Kubernetes and Storm

This article details how Dada’s SRE team rebuilt its logging platform—replacing ELK with Filebeat and Storm, deploying Elasticsearch hot and cold nodes on Kubernetes, and optimizing log collection, parsing, and storage to handle over 130 billion daily log entries and 300 TB of data.

Efficient Ops

Dec 21, 2020

How Dada Scaled Log Processing to 130 Billion Entries Daily with Kubernetes and Storm

1. Background

After the 2016 merger of Dada and JD Daojia, online services grew rapidly, and the existing ELK‑based log system could no longer meet the massive log volume and query demands. Dada needed a more efficient, automated logging solution capable of processing over 130 billion logs per day, storing more than 14 TB daily, and managing a total of 300 TB.

2. Historical Baggage

The first ELK system suffered from three main problems:

Log collection not automated

Logs lacked standardized parsing

Elasticsearch storage was chaotic

Log collection not automated

Each new log ingestion request required developers to coordinate with operations to modify Flume configurations, a manual and error‑prone process that also caused missed logs during service scaling.

Logs lacked standardized parsing

The classic ELK pipeline (Flume → Kafka → Logstash → Elasticsearch → Kibana) used Logstash’s grok plugin, which became a bottleneck for large‑scale log parsing.

Elasticsearch storage chaotic

Rough management created over 600 query indices with unclear purposes, and auto‑generated templates caused type‑conflict errors, making data storage and Kibana queries difficult.

3. System Evolution

To solve the three issues, Dada standardized log formats, used a single Elasticsearch index per day, and differentiated logs by tags. Large‑volume logs or special storage requests were split into separate Kafka topics and Elasticsearch indices.

Automated Log Collection

Standardized log naming (e.g.,

/path/to/project/$(appName)_info.log</code><code>/path/to/project/$(appName)_error.log</code><code>/path/to/project/$(appName)_access.log</code><code>/path/to/project/topic_$(nameOfTopic).log

) and replaced Flume with Filebeat, which adds appropriate tags based on file names and routes logs to specific Kafka topics, achieving fully unattended collection.

Log Formatting and Parsing

Logstash was replaced by a Storm‑based parsing module written in Java. In Storm, a spout reads from Kafka and bolts process the data. The core logParserBolt applies different parsing logic (JSON, regex, etc.) based on the log’s source tag, then forwards formatted logs to sendESBolt for Elasticsearch storage.

Example logs from a service named “bill”:

bill_info.log</code><code>bill_error.log</code><code>bill_access.log

Filebeat tags info and error logs with source: appLog, while access logs receive source: access. This directs them to different parsers and Elasticsearch indices.

Initially each Kafka topic had its own Storm topology, but this wasted resources. The new design groups multiple topics into a single topology, using the source tag to select the appropriate parser, improving CPU utilization.

The Storm‑based parser increased parsing efficiency and reduced maintenance costs, handling 130 billion logs per day with a cluster that grew from 3×8C16G servers to 20×8C16G during the 2019 Double‑11 peak.

Elasticsearch Index Management

Logs are stored in daily indices, with about 120 new indices per day and roughly 300 active indices overall. Fixed mapping templates ensure consistent structure, and tags enable precise searches.

4. Elasticsearch on Kubernetes

Dada’s Elasticsearch cluster expanded from 5 nodes to 15 hot nodes (32C 64G 3T SSD) and 5 cold nodes (32C 96G 48T SATA), providing 300 TB total capacity with 7 TB added daily.

Hot nodes handle real‑time writes and recent three‑day queries; older data is migrated to cold nodes.

Cold nodes showed low CPU usage (<3 %) except during migration. By deploying cold nodes as Kubernetes StatefulSets with host networking and allocating 4 CPU 56 GB each, the remaining resources (28 CPU 40 GB) can run other workloads.

Go was used to rewrite the Java parsing logic, allowing the five cold nodes to replace the previous Storm parsing cluster. During traffic spikes, additional Kubernetes nodes can be provisioned in the public cloud.

5. Summary and Future Plans

The evolved logging system solves three core problems: automated collection, efficient parsing, and searchable storage.

Introducing Filebeat and Storm replaced Flume and Logstash, while the hot‑cold Elasticsearch architecture on Kubernetes improved storage capacity and CPU utilization.

However, growing Nginx and application access logs strain Elasticsearch. Dada is researching ClickHouse as a high‑performance alternative for log analytics to further extract value from the data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

log management Storm

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.