How Vipshop’s Three‑Tier Monitoring System Keeps Services Running Smoothly
Vipshop’s three‑tier monitoring system—covering system, application (Mercury), and business layers—collects and analyzes logs from distributed components, providing real‑time metrics, slow‑call detection, error tracing, and configurable alerts to help engineers quickly pinpoint and resolve performance issues.
Vipshop’s distributed business system consists of many components such as web front‑ends, RPC services, caches, message queues and databases.
When a front‑end request reaches the back‑end, it passes through multiple components, leaving logs that are scattered and hard to use for troubleshooting.
A performance‑monitoring platform collects, aggregates and analyzes these logs to pinpoint problems quickly, enabling developers to fix issues promptly.
Three‑Tier Monitoring at Vipshop
Vipshop implements monitoring at three levels:
System level – server metrics (CPU, memory, disk, network traffic, TCP connections) and database metrics (QPS, replication lag, processes, slow queries).
Application level – the Mercury platform, an in‑house APM that injects probes into applications to monitor code, databases and caches in real time.
Business level – data is extracted either by embedding instrumentation in specific pages or by pulling relevant records from business databases, then stored in an operations‑maintained database for analysis and dashboard display.
Key Features of Mercury
Locate slow calls (slow Web/RESTful services, OSP services, slow SQL).
Detect errors (4XX, 5XX, etc.).
Identify exceptions and show dependency topology.
Trace call chains with end‑to‑end timing, context and exception information.
Application alerts based on configurable rules, reporting to Vipshop’s central alert platform.
Core Architecture
The platform processes logs through two main paths:
Raw logs (Trace/Exception) are sent via Kafka, then Flume writes them directly to HBase.
Log data is also streamed through Kafka to Spark Streaming; Spark analyzes the stream, computes performance metrics and writes data points to OpenTSDB.
The most critical aspect of the pipeline is ensuring that data consumption does not lose or backlog messages.
When a metric matches a pre‑configured alert rule, the alert module triggers an action and reports the fault to the central alert platform in real time.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
