Big Data 9 min read

Building a Cost‑Effective Real‑Time Stream Processing Platform with Storm

This article details how the e‑commerce company 1号店 selected the Storm framework to create a low‑cost, highly available, and easily scalable distributed stream‑processing system, covering architecture design, resource isolation with CGroup, custom UI improvements, and operational lessons for handling massive traffic spikes.

21CTO
21CTO
21CTO
Building a Cost‑Effective Real‑Time Stream Processing Platform with Storm

Stream Computing Solution

1号店 combined its business needs with a focus on cost reduction and ultimately adopted the Storm computing framework to implement its distributed stream processing platform. The overall real‑time data processing flow is illustrated in Figure 1.

Figure 1: Distributed Stream Computing System
Figure 1: Distributed Stream Computing System

Tracker, a proprietary data recording solution, works with Flume to form the website data collection module, ensuring efficient and stable log recording while supporting horizontal scaling. Kafka is used as the front‑end message buffer to minimize data loss and satisfy various business requirements for parallelism and ordering. Storm‑processed results are persisted or discarded according to business needs.

To further guarantee stability, merely having fault‑tolerance is insufficient; the platform must also mitigate overload risks. Linux containers, based on CGroup, provide fine‑grained resource isolation (CPU, memory, block I/O, network) down to the process level, preventing any single business process from monopolizing system resources.

Storm itself does not natively support CGroup isolation. While Storm on YARN offers cluster‑level resource isolation, the requirement here is to limit resources at the topology level, i.e., per process. Consequently, a custom CGroup resource management framework was designed, as shown in Figure 2.

Figure 2: Resource Management Framework on Storm Cluster
Figure 2: Resource Management Framework on Storm Cluster

Simpler User Experience

Users do not need to master the complexities of CGroup (hierarchies, subsystems, groups, OS and filesystem knowledge). They only need to perform three operations:

Create/delete a CGroup of a supported type (cpu, memory, cpuset). Assign processes to a specific CGroup. Use the redesigned client command (ycgClient) to execute the above.

Additionally, users set the priority of a Storm topology via the Storm UI, where priority reflects the amount of resources the process group can obtain. A daemon (ycgManager) automatically manages per‑node process‑level priorities.

Automation Reduces Manual Management Cost

Topology priority information is stored in a ZooKeeper cluster. The resource management framework adapts to dynamic addition of heterogeneous nodes. Cluster configuration (logical CPU count, memory, swap size) is automatically stored in ZooKeeper, simplifying management.

This enables resource limits at the Worker/Executor level—something Storm on YARN cannot achieve.

Module Independent of Computing Framework for Easy Deployment and Rollback

Reasonable improvements can be readily applied to other framework platforms. To address shortcomings of the original Storm UI (command‑line only submission, lack of operation logs, missing user permission control), a new UI was re‑implemented in a non‑Clojure language, reducing learning and maintenance costs.

The redesigned UI now allows administrators to manage users and permissions, view topology operation logs, and perform effective tracking. Future plans include finer‑grained real‑time monitoring of topologies.

Since deploying on Storm, 1号店 has seen clear benefits: in the first half of 2014, daily logs exceeded 200 million entries; real‑time monitoring, personalized recommendation, and BI applications achieve sub‑second response times. Additional workloads such as security log analysis, fraud detection, and order monitoring also run successfully.

Conclusion

As a rapidly growing mid‑size e‑commerce platform, 1号店 recognizes that its real‑time computing platform will face increasing pressure. While maintaining current system stability, the team continuously gathers feedback, conducts research, and performs iterative development to improve peak performance and reliability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Resource ManagementcgroupStorm
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.