How Vipshop Leverages Data Processing, Analytics, and Mining for Smarter Ops
This article summarizes Wu Xiaoguang's talk at Gdevops 2017, detailing how Vipshop integrates data processing, analysis, and mining technologies—such as Flume, Kafka, Spark, and custom scheduling—to improve operational decision‑making, performance monitoring, root‑cause analysis, and predictive modeling across its e‑commerce platform.
Data Processing Technology Application
Key challenges addressed are data accuracy & timeliness, real‑time computation of massive data, real‑time monitoring and visualization of multidimensional data, and A/B testing support.
Data collection
Log data from client and server sides. Server logs are written locally and shipped to a Kafka cluster via Flume. Client logs are collected either through asynchronous Nginx requests (web) or API calls from the app (APP).
Database data is obtained either by computing metrics directly on a replica or by parsing binlogs, converting them to messages and sending them to Kafka for downstream consumption.
Data computation
Real‑time ETL and aggregation are performed with Spark. Metrics are defined as additive across dimensions; non‑additive metrics (e.g., percentages) are recomputed at presentation time. For simple, time‑only metrics, a custom scheduler Saturn runs scripts on the replica. Detailed log queries are served by Elasticsearch.
Data storage & presentation
Aggregated results are stored in a ROLAP database for current‑day multidimensional data and a MOLAP database for historical data. Redis caches second‑level data and configuration. An application server merges these sources to satisfy front‑end queries (≤3 s) and millisecond‑level alerting.
A/B testing implementation
Two approaches are used:
Client‑side switches in the APP, each managing a single experiment.
Backend A/B grouping service that guarantees orthogonal user distribution across concurrent experiments.
Data Analysis Technology Application
Performance analysis
Design of metrics and dimensions is coordinated with developers to confirm data collection, A/B testing plans, and statistical definitions. Example: a full‑site HTTPS upgrade was monitored for conversion rate, error codes, and network latency to ensure no degradation.
Root‑cause analysis
Process:
Identify the problematic metric.
Explore data via simple sorting or statistical anomaly detection assuming normal distribution.
Validate findings against business knowledge.
Typical cases include network connectivity drops linked to a faulty app version and CDN‑related image errors.
Data Mining Technology Application
Prediction
Time‑series forecasting uses the Holt‑Winters algorithm for metrics such as orders, PV, and UV. During promotions, additional features (historical values, related metrics) are fed into a machine‑learning model to improve accuracy. To reduce false alarms, an alarm triggers when cumulative predicted loss reaches a threshold (e.g., 100 orders).
When a fault occurs, real values are replaced by predictions based on the previous week to avoid contaminating the model.
Root‑cause mining
Workflow:
Sample error and normal logs.
Encode non‑numeric fields (e.g., one‑hot).
Compute feature importance using Spearman correlation and Mutual Information.
If results agree, rank features directly; otherwise train a logistic regression model with L1 regularization.
When logistic regression is inconclusive, fallback to Random Forest or AdaBoost and use the model’s feature‑importance scores.
Application Ecosystem Construction and Planning
An independent analysis team is established to embed data‑driven decision making into the operations workflow, improving efficiency and knowledge transfer.
Data‑analysis platform : a data warehouse built on traditional big‑data technologies (Kafka, Flume, Spark, Elasticsearch, ROLAP/MOLAP, Redis).
Unified information platform : aggregates release information, ITIL events, CMDB data, and monitoring alerts into a central repository.
Both algorithmic and rule‑based approaches are combined: algorithms handle point‑level issues (e.g., anomaly detection, prediction), while a rule engine encodes business knowledge to resolve complex correlations.
Short‑term goals focus on effective alarm suppression and merging; long‑term goals aim for automated fault localization and remediation actions such as service restart, downgrade, rate‑limiting, disk management, and traffic scheduling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
