Operations 20 min read

From ITIL to SRE: How Vipshop Transformed Its Operations

This article recounts Vipshop’s journey from a traditional ITIL‑based operations model to an SRE‑inspired, automated workflow, detailing the construction of ITIL processes, the challenges faced, the shift toward automation, and personal insights on managing people, quality, and change.

Efficient Ops
Efficient Ops
Efficient Ops
From ITIL to SRE: How Vipshop Transformed Its Operations

1. ITIL: Construction Methods and Bottlenecks

Vipshop began building its operations system in 2013, five years after the company was founded. By that year the daily order volume reached 130,000, the user base hit 50 million, and the fleet grew to over 8,000 servers, generating roughly 20 incidents per day. To make failures controllable, the team introduced a comprehensive ITIL framework centered on strict release, change, and incident‑management processes.

Typical release cycles took 3–5 days, with developers required to submit change tickets at least 24 hours in advance and obtain multiple levels of managerial approval. The emphasis was on quality as a pass‑level requirement; without it, speed and cost improvements were meaningless.

ITIL’s four key characteristics were highlighted: it is a management system distinct from DevOps or SRE, it mandates that processes precede systems, it encourages fixing failures by refining processes, and it assigns responsibility to individuals rather than to the system.

2. Challenges and Breakthroughs

After three years of refinement (2013‑2016), the ITIL model reduced annual incidents from over 5,700 to about 2,800 while the business grew dramatically. However, the team faced diminishing returns from adding more processes, human resistance to ever‑increasing bureaucracy, and the inability of any process to cover every possible failure.

Specific dilemmas included whether to elevate a recurring network jitter alarm to a disaster level—doing so would force rapid response but risk alert fatigue—and how to balance fairness when different services have varying sensitivity to resource limits.

3. Automation and SRE Attempts

Starting in 2016, Vipshop experimented with SRE practices. Over a year, more than 30 services were trialed, and the team defined two core SRE responsibilities: transferring operational tasks to other teams (e.g., service desk, developers) and taking ownership of service quality.

Key initiatives included building an automated operations platform that isolated command execution, enforced approval workflows, and collected production data. The platform also standardized change deployment through scripted commands and packaged RPMs to ensure reproducibility.

Automation extended to a “train release” model where weekly releases were ordered like train cars, allowing high‑priority services to be moved forward by a “train conductor” (the release manager) when necessary.

4. Personal Insights and Reflections

The author emphasizes that any operations framework ultimately manages human nature; policies must consider whether people are viewed as inherently good or bad. Simpler systems are preferable, and even small changes can trigger a butterfly effect.

Choosing between ITIL and DevOps/SRE depends on the organization’s maturity: ITIL offers strong control and is suitable for early‑stage companies, while DevOps/SRE brings flexibility but requires cultural readiness.

automationoperationsdevopsSREITIL
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.