Operations 4 min read

How Vipshop’s Three‑Tier Monitoring System Keeps Services Running Smoothly

This article explains Vipshop’s multi‑layer monitoring architecture, detailing system‑level metrics, application‑level tracing with the Mercury platform, and business‑level KPI dashboards, while describing the data pipelines that collect, process, and alert on distributed logs to ensure reliable operations.

21CTO
21CTO
21CTO
How Vipshop’s Three‑Tier Monitoring System Keeps Services Running Smoothly

Business systems are typically composed of many distributed components such as web services, RPC services, caches, message queues, and databases. When a front‑end request reaches the back‑end, it traverses multiple components, leaving scattered logs that make troubleshooting difficult.

Performance monitoring systems address this challenge by collecting, aggregating, and analyzing log information to pinpoint issues quickly, enabling developers to fix problems promptly.

Vipshop’s three‑tier monitoring

1) System layer : monitors server metrics (CPU, memory, disk, traffic, TCP connections) and database metrics (QPS, replication lag, processes, slow queries).

2) Application layer – Mercury platform: an in‑house application performance monitoring solution that injects probes into applications to monitor code, databases, and caches in real time.

3) Business layer : extracts key business indicators (PV, UV, product displays, login/registration, conversion rate, cart, order count, payment, shipment, warehouse data) either by embedding instrumentation on pages or by pulling data from business databases, then stores processed results in an operations‑maintained database for dashboard display and customizable alerts.

Mercury’s main functions

Locate slow calls (slow Web/RESTful services, OSP services, slow SQL).

Locate errors (4XX, 5XX, etc.).

Locate exceptions (exception dependency and topology).

Trace call chains, showing end‑to‑end calls with context, exception logs, and per‑call latency.

Application alerts based on predefined rules, reporting to Vipshop’s central alert platform.

Core architecture

Two data paths are used:

Raw logs (Trace/Exception) are sent through Kafka, then Flume, and finally persisted to HBase.

Log information is sent through Kafka directly to Spark Streaming, where it is analyzed and transformed into performance metrics written to OpenTSDB.

The most critical aspect of the pipeline is ensuring that data consumption does not lose or backlog messages. When alert rules configured by operations staff are satisfied, the alert module triggers actions, reporting faults to the central alert platform in real time.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsmonitoringOperationsperformance tracinglog aggregationVipshop
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.