How Dada Scaled Its Log System to 130 Billion Daily Entries with Kubernetes and Storm
This article details how Dada built a Kubernetes‑mixed log platform that handles over 130 billion logs per day, stores more than 14 TB daily, and maintains a 300 TB total volume by automating collection with Filebeat, parsing with Storm, and optimizing Elasticsearch with hot‑cold nodes.
Background
After the 2016 merger of Dada and JD Daojia, online traffic surged, exposing the limitations of the existing ELK‑based log system. The team needed a highly efficient, automated solution capable of handling massive log volumes and fast query requirements.
Historical Issues
Log collection was manual; every new log source required coordination between developers and operations to modify Flume configurations.
Logs lacked standardized parsing, relying on Logstash’s grok plugin, which became a performance bottleneck at scale.
Elasticsearch indices were poorly managed, with hundreds of ambiguous indices and frequent type‑conflict errors.
System Evolution
Automated Log Collection
The team first standardized log naming and formats across services, e.g.:
/path/to/project/$(appName)_info.log /path/to/project/$(appName)_error.log /path/to/project/$(appName)_access.log /path/to/project/topic_$(nameOfTopic).logFilebeat replaced Flume, adding tags based on file names and routing logs to appropriate Kafka topics automatically, eliminating manual configuration and reducing resource consumption.
Efficient Log Parsing with Storm
Logstash could no longer keep up, so the team adopted a Java‑based Storm topology. In Storm, spout components ingest data from Kafka, while bolt components process it. The core logParserBolt applies different parsing rules (JSON, regex, etc.) based on the log’s source tag, producing a structured logJson and determining the target Elasticsearch index ( esIndex).
Initially each Kafka topic had its own Storm topology, which wasted resources. The revised design consolidates multiple topics into a single topology, using a shared logParserBolt that selects the appropriate parser at runtime, dramatically improving server utilization.
Elasticsearch Index Management
Each log type is stored in a dedicated Elasticsearch index, searchable via tags. The system now creates roughly 120 new indices daily, maintaining about 300 active indices with fixed mapping templates, ensuring stable query performance.
Elasticsearch on Kubernetes (Hot‑Cold Nodes)
The Elasticsearch cluster grew from 5 nodes to a mixed deployment of 15 hot nodes (32 CPU, 64 GB RAM, 3 TB SSD) and 5 cold nodes (32 CPU, 96 GB RAM, 48 TB SATA). Hot nodes handle real‑time writes and recent‑three‑day queries; older data resides on cold nodes.
Cold nodes initially suffered low CPU utilization. By deploying cold nodes via Kubernetes StatefulSets and allocating a fixed 4‑core, 56 GB slice, the remaining resources (28 CPU, 40 GB) could be repurposed for other workloads.
After migrating log parsing logic from Storm (Java) to Go and moving it onto the cold nodes, CPU usage on cold nodes rose from ~3 % to ~40 % during peak hours, a ten‑fold increase.
Summary and Future Plans
The revamped system solved three core challenges: automated collection, high‑performance parsing, and efficient storage/search. By introducing Filebeat, Storm, and a hot‑cold Elasticsearch architecture on Kubernetes, Dada achieved daily processing of 130 billion logs and a 300 TB storage capacity.
Looking ahead, the team is evaluating ClickHouse as a replacement for Elasticsearch to further improve large‑scale log analytics and uncover deeper insights from ever‑growing Nginx and application access logs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
