Big Data 16 min read

Log Platform Architecture and Scaling Lessons from Vipshop's 419 Promotion

This article presents a detailed case study of Vipshop's log platform during the 419 sales event, analyzing the 2013 architecture, bottlenecks in RabbitMQ and Storm, and the subsequent redesign using Kafka, Impala, and HBase to achieve scalable, reliable big‑data processing.

Architecture Digest
Architecture Digest
Architecture Digest
Log Platform Architecture and Scaling Lessons from Vipshop's 419 Promotion

Vipshop's biggest annual promotion, the 419 flash sale on April 19, generates massive user traffic that creates peak loads for its log platform. The author, a data platform engineer at Vipshop, shares the challenges faced and the solutions applied.

2013 Architecture

The 2013 log pipeline used Flume for collection, RabbitMQ as the message broker, Storm and Redis for real‑time computation, and MySQL for visualization. During the sale, the system suffered severe latency spikes, eventually causing a cluster-wide collapse due to a snow‑ball effect.

Post‑mortem identified RabbitMQ and Storm as the primary bottlenecks. RabbitMQ could handle only about 12,000 messages per second per node, far below the required 150,000 msgs/s, while Storm’s processing latency grew as load increased.

Storm Computation Details

Storm calculated PV/UV from user logs and aggregated Nginx metrics (domain traffic, response time, error codes) using Redis counters (incr, incrby) and periodic truncation via crontab. The logic was simple: split logs, update Redis keys, and derive metrics.

To isolate the broker’s performance, a Python script was used to produce and consume messages directly, revealing each RabbitMQ node could sustain roughly 10,000 msgs/s even with high CPU load. Even after enabling Erlang HiPE for a 20% boost, the throughput remained insufficient, requiring an impractical number of servers.

ElasticSearch, used for full‑text search, also became a bottleneck for Nginx logs and was eventually abandoned in favor of a Hive‑based query interface.

Architectural Changes

• Replace RabbitMQ with Kafka : Kafka’s design for high‑throughput log transport allowed tens of thousands of msgs/s per node with minimal CPU impact, providing balanced load‑distribution and eliminating the previous broker limitation.

• Replace Storm with Impala for Nginx logs : Impala, built on HDFS, offers minute‑level latency suitable for the required 2‑minute aggregation, reducing hardware pressure compared to millisecond‑level Storm processing.

• Discard ElasticSearch : Due to performance and maintenance constraints, ES was removed; a Hive front‑end now serves log queries, accepting a few minutes of latency.

2014 Improvements

After scaling Kafka and adding a few Storm nodes, the 2014 promotion ran smoothly. The updated architecture eliminated ES, used Impala for Nginx metrics, and relied on Kafka for log transport.

Pre‑Promotion Preparations

Key steps include adding system‑level and application‑level monitoring (CPU, memory, disk I/O, Kafka throughput, Storm latency), estimating total load, calculating linear scalability using a quadratic model, and conducting pressure‑reduction rehearsals by disabling portions of the cluster.

Future Directions

• Migrate from Storm to Spark Streaming for near‑real‑time processing, leveraging better resource management.

• Store raw logs in HBase and use SolrCloud (or ES‑like) for indexing, while routing retrieval queries to HBase to reduce indexing pressure.

• Expand HBase usage to replace MySQL for data presentation and raw‑log storage, as illustrated in the proposed future architecture diagram.

In conclusion, handling peak traffic during large‑scale promotions provides valuable opportunities to test and improve system design, and the shared experiences aim to help practitioners facing similar challenges.

architecturebig datascalabilitykafkastormlog-processingImpala
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.