Big Data 23 min read

How Vipshop Leverages Data Processing, Analytics, and Mining for Smarter Ops

This article summarizes Wu Xiaoguang's talk at Gdevops 2017, detailing how Vipshop integrates data processing, analysis, and mining technologies—such as Flume, Kafka, Spark, and custom scheduling—to improve operational decision‑making, performance monitoring, root‑cause analysis, and predictive modeling across its e‑commerce platform.

dbaplus Community

Jan 1, 2018

How Vipshop Leverages Data Processing, Analytics, and Mining for Smarter Ops

Data Processing Technology Application

Key challenges addressed are data accuracy & timeliness, real‑time computation of massive data, real‑time monitoring and visualization of multidimensional data, and A/B testing support.

Data collection

Log data from client and server sides. Server logs are written locally and shipped to a Kafka cluster via Flume. Client logs are collected either through asynchronous Nginx requests (web) or API calls from the app (APP).

Database data is obtained either by computing metrics directly on a replica or by parsing binlogs, converting them to messages and sending them to Kafka for downstream consumption.

Data computation

Real‑time ETL and aggregation are performed with Spark. Metrics are defined as additive across dimensions; non‑additive metrics (e.g., percentages) are recomputed at presentation time. For simple, time‑only metrics, a custom scheduler Saturn runs scripts on the replica. Detailed log queries are served by Elasticsearch.

Data storage & presentation

Aggregated results are stored in a ROLAP database for current‑day multidimensional data and a MOLAP database for historical data. Redis caches second‑level data and configuration. An application server merges these sources to satisfy front‑end queries (≤3 s) and millisecond‑level alerting.

A/B testing implementation

Two approaches are used:

Client‑side switches in the APP, each managing a single experiment.

Backend A/B grouping service that guarantees orthogonal user distribution across concurrent experiments.

Data Analysis Technology Application

Performance analysis

Design of metrics and dimensions is coordinated with developers to confirm data collection, A/B testing plans, and statistical definitions. Example: a full‑site HTTPS upgrade was monitored for conversion rate, error codes, and network latency to ensure no degradation.

Root‑cause analysis

Process:

Identify the problematic metric.

Explore data via simple sorting or statistical anomaly detection assuming normal distribution.

Validate findings against business knowledge.

Typical cases include network connectivity drops linked to a faulty app version and CDN‑related image errors.

Data Mining Technology Application

Prediction

Time‑series forecasting uses the Holt‑Winters algorithm for metrics such as orders, PV, and UV. During promotions, additional features (historical values, related metrics) are fed into a machine‑learning model to improve accuracy. To reduce false alarms, an alarm triggers when cumulative predicted loss reaches a threshold (e.g., 100 orders).

When a fault occurs, real values are replaced by predictions based on the previous week to avoid contaminating the model.

Root‑cause mining

Workflow:

Sample error and normal logs.

Encode non‑numeric fields (e.g., one‑hot).

Compute feature importance using Spearman correlation and Mutual Information.

If results agree, rank features directly; otherwise train a logistic regression model with L1 regularization.

When logistic regression is inconclusive, fallback to Random Forest or AdaBoost and use the model’s feature‑importance scores.

Application Ecosystem Construction and Planning

An independent analysis team is established to embed data‑driven decision making into the operations workflow, improving efficiency and knowledge transfer.

Data‑analysis platform : a data warehouse built on traditional big‑data technologies (Kafka, Flume, Spark, Elasticsearch, ROLAP/MOLAP, Redis).

Unified information platform : aggregates release information, ITIL events, CMDB data, and monitoring alerts into a central repository.

Both algorithmic and rule‑based approaches are combined: algorithms handle point‑level issues (e.g., anomaly detection, prediction), while a rule engine encodes business knowledge to resolve complex correlations.

Short‑term goals focus on effective alarm suppression and merging; long‑term goals aim for automated fault localization and remediation actions such as service restart, downgrade, rate‑limiting, disk management, and traffic scheduling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Operations data processing data analytics Root Cause Analysis Predictive Modeling

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.