Operations 4 min read

How Vipshop’s Three‑Tier Monitoring System Keeps Services Running Smoothly

Vipshop’s three‑tier monitoring system—covering system, application (Mercury), and business layers—collects and analyzes logs from distributed components, providing real‑time metrics, slow‑call detection, error tracing, and configurable alerts to help engineers quickly pinpoint and resolve performance issues.

Java High-Performance Architecture

Mar 16, 2016

How Vipshop’s Three‑Tier Monitoring System Keeps Services Running Smoothly

Vipshop’s distributed business system consists of many components such as web front‑ends, RPC services, caches, message queues and databases.

When a front‑end request reaches the back‑end, it passes through multiple components, leaving logs that are scattered and hard to use for troubleshooting.

A performance‑monitoring platform collects, aggregates and analyzes these logs to pinpoint problems quickly, enabling developers to fix issues promptly.

Three‑Tier Monitoring at Vipshop

Vipshop implements monitoring at three levels:

System level – server metrics (CPU, memory, disk, network traffic, TCP connections) and database metrics (QPS, replication lag, processes, slow queries).

Application level – the Mercury platform, an in‑house APM that injects probes into applications to monitor code, databases and caches in real time.

Business level – data is extracted either by embedding instrumentation in specific pages or by pulling relevant records from business databases, then stored in an operations‑maintained database for analysis and dashboard display.

Key Features of Mercury

Locate slow calls (slow Web/RESTful services, OSP services, slow SQL).

Detect errors (4XX, 5XX, etc.).

Identify exceptions and show dependency topology.

Trace call chains with end‑to‑end timing, context and exception information.

Application alerts based on configurable rules, reporting to Vipshop’s central alert platform.

Core Architecture

The platform processes logs through two main paths:

Raw logs (Trace/Exception) are sent via Kafka, then Flume writes them directly to HBase.

Log data is also streamed through Kafka to Spark Streaming; Spark analyzes the stream, computes performance metrics and writes data points to OpenTSDB.

The most critical aspect of the pipeline is ensuring that data consumption does not lose or backlog messages.

When a metric matches a pre‑configured alert rule, the alert module triggers an action and reports the fault to the central alert platform in real time.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems monitoring Performance APM Alerting log-aggregation

Written by

Java High-Performance Architecture

Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.