How Bilibili Scaled Its Operations: From Chaos to Automated DevOps
In this talk, Bilibili’s operations manager shares the journey from early firefighting to a standardized, automated DevOps pipeline, covering ansible‑based configuration, high‑quality release processes, metric‑driven monitoring, and elastic Docker‑based scaling using Mesos, Marathon, Consul, and custom IPAM plugins.
Preface
I am honored to share my experience this afternoon; I was a bit nervous but hoped to leave a memorable impression.
My name is Liang Xiaocong, operations development manager at Bilibili, joining in May 2015 as the second ops engineer, and have witnessed the whole evolution of Bilibili's operations.
Story Beginning: Bilibili Crashed
In 2015 a trending phrase on Weibo asked "Did Bilibili crash today?" Our ops team, then only four people, were constantly firefighting. As the business grew, we needed a breakthrough.
Standardize our processes.
Establish a high‑quality delivery pipeline.
Build a elastic compute‑resource platform.
How to Standardize
Standardization usually means writing rules and documenting them, but with a shortage of ops staff we turned documentation into executable code using Ansible.
For me, a tool that can batch‑execute commands is not true automation; the essence of Ansible lies in the playbook, which describes the desired state rather than a sequence of steps.
Ansible‑playbooks are result‑oriented and idempotent, allowing repeated execution without side effects.
We created a Docker deployment playbook that abstracts Docker and Mesos into roles, enabling users to deploy a standard Docker environment without knowing low‑level details.
Improving Delivery Quality
We needed a standardized release process and a fast feedback mechanism for release quality.
We built a simple release system that enforces directory structures for code, configuration, and logs, helping us solve most problems through a standardized workflow.
Testing and Monitoring
We distinguish measurement (to control code) from monitoring (to detect problems). Our measurement system uses a load‑balancer front‑end, a StatsD cluster for aggregation, and Graphite for time‑series storage, all horizontally scalable.
Developers can report custom metrics with a few lines of code, enabling them to observe latency, error rates, etc., and decide whether to roll back.
How to Achieve Elastic Scaling
During traffic spikes we need rapid resource expansion. We built a Docker‑based scalable web cluster.
The architecture includes:
Service discovery container registering itself to Consul.
IPAM plugin assigning planned IP addresses to containers.
Monitor agent reporting container metrics.
LVS for layer‑4 load balancing and Nginx for layer‑7.
A resource scheduler selects idle resources for scaling, while a logging center collects container logs.
We use the open‑source Upsync module to dynamically update Nginx upstreams from Consul data without reloads.
Our monitoring agent runs as a container, mounts the Docker daemon socket, gathers metrics, and pushes them to InfluxDB, providing the data for autoscaling decisions.
Service discovery is handled by Consul, chosen for its compatibility with Upsync.
Network requirements:
Each container must have an independent IP.
Business‑level network isolation.
Simplicity for troubleshooting.
We evaluated CNI, CNM, and finally selected a macvlan‑based solution, mapping business VLANs to physical switches.
We implemented a custom IPAM plugin (Python Flask API) that supports address allocation, release, pool queries, and pool release, storing data in Consul.
Summary
The above details our journey from zero to one in building an operations system for a fast‑growing internet company.
Key takeaways:
Maintain a strong mindset; transitioning from a mature ops environment to a startup brings many challenges.
Identify a few critical entry points—standardization, delivery quality, and elastic scaling.
Adopt a fast‑iteration approach; large‑scale practices may not fit startups.
Avoid tactical diligence that masks strategic laziness; always reflect on the purpose and efficiency of your work.
Q&A
Q: What were your main entry points? A: Standardization, improving delivery quality, and building elastic scaling resources.
Q: How to shift from tactical diligence to strategic effectiveness? A: Regularly review the significance of your weekly tasks, plan improvements, and seek resources to execute them.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
