Designing a Real-Time Stream Computing Platform for E‑commerce Peak Traffic at Yihaodian
The article describes how Yihaodian built a low‑latency, highly available, and easily scalable streaming computation platform using Storm, Kafka, Linux containers and a custom CGroup management framework to handle massive e‑commerce traffic spikes and real‑time analytics.
During major e‑commerce events such as JD 618, Yihaodian 711 and Double 11, traffic surges can cause minute‑level failures that lead to significant revenue loss, making low latency, high availability, and easy scalability essential for peak‑traffic systems.
While Hadoop excels at batch processing, many e‑commerce services (search, recommendation, monitoring, anti‑fraud) require real‑time processing at sub‑second latency, prompting Yihaodian to adopt the Storm stream‑processing framework.
Yihaodian Stream Computing Solution
Yihaodian selected Storm as the core of its distributed stream‑computing platform. The overall data‑flow is illustrated in Figure 1.
Figure 1 Yihaodian Distributed Stream Computing System
Yihaodian developed a custom data‑recording component called Tracker, which works with Flume to collect website logs efficiently and support horizontal scaling. Kafka is used as a front‑end message buffer to minimize data loss and satisfy various parallelism and ordering requirements. Storm‑processed results are either persisted or discarded according to business needs.
To further improve stability, Yihaodian leverages Linux container technology (cgroups) to isolate CPU, memory, block‑device I/O and network resources at the process level, preventing any single business process from monopolizing system resources and causing node slowdown or failure.
Because Storm does not natively support cgroup isolation, Yihaodian designed its own CGroup resource‑management framework, shown in Figure 2, to enforce per‑Topology resource limits.
Figure 2 Resource Management Framework on Yihaodian Storm Cluster
Simplified User Experience
Users only need to perform three operations: create/delete a CGroup (supported types: cpu, memory, cpuset), assign a process to a specific CGroup, and set Topology priority via the Storm UI (priority reflects the amount of resources the process group can obtain). These actions are executed through a custom client command ycgClient, while a daemon ycgManager automatically manages per‑node process‑level priorities.
Automation Reduces Manual Management Cost
Topology priority information is stored in a ZooKeeper cluster.
The resource‑management framework adapts to dynamic addition of heterogeneous machines.
Cluster configuration (logical CPU count, memory, swap size) is automatically stored in ZooKeeper, simplifying overall cluster management.
Worker/Executor‑Level Resource Limiting
This capability is not available in the standard Storm‑on‑YARN implementation.
Modular Design for Easy Deployment and Rollback
The framework is independent of the underlying compute engine, allowing straightforward installation, deployment, and rollback, and can be adapted to other platforms with minor modifications.
To address shortcomings of the original Storm UI (command‑line‑only submission, lack of operation logs, missing permission control), Yihaodian re‑implemented the UI in a non‑Clojure language, adding user management, permission settings, operation logging, and plans for finer‑grained real‑time monitoring.
Since early 2014, Yihaodian’s Storm platform processes over 200 million log entries per day, supporting real‑time traffic monitoring, personalized recommendation, BI, security‑log analysis, fraud detection, and order monitoring with sub‑second response times.
Conclusion
As a rapidly growing mid‑size e‑commerce company, Yihaodian recognizes increasing pressure on its real‑time computing platform. Ongoing efforts focus on performance and reliability improvements, continuous feedback collection, and further research and development to meet future peak‑traffic challenges.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
