How Bilibili Scaled Its Ops: From DIY Deployments to Prometheus Monitoring
From early manual deployments to a sophisticated, multi-layered monitoring stack—including ELK, Zabbix, Statsd, Grafana, and Prometheus—Bilibili’s ops team shares the evolution, challenges, and lessons learned in building scalable, automated infrastructure for massive internet traffic.
Preface
With the rapid growth of the internet, data volumes increase and operations become more critical. Attending the GOPS2017 Global Ops Conference in Shenzhen, the author shares Bilibili's ops monitoring system development over the past year.
1. Automated Deployment
In 2015 the ops team was newly formed and overwhelmed. They built a simple deployment system using OpenLDAP for authentication, GitLab for code hosting, Jenkins for builds, and Ansible for scripting, all wrapped in a command‑line interface.
2. Necessity of Monitoring
The typical backend architecture includes CDN, LVS load balancers, Tengine SLB, caches, queues, and databases. Languages such as Go, Java, PHP, Python, and C/C++ are used. Lack of monitoring made fault diagnosis difficult.
3. First Alert: ELK
The team collected error logs from the CDN, indexed them in Elasticsearch by domain, node, user IP, and error code. Alerts via SMS and WeChat allowed quick fault localization.
4. Basic Monitoring with Zabbix
After CDN monitoring, the team adopted Zabbix for foundational monitoring because it is quick to deploy, offers flexible alert strategies, and has a large user base.
5. Application Monitoring with StatsD
StatsD provides lightweight metric collection via UDP, avoiding impact on the monitored program. Integrated with Graphite, it produces graphs for various metrics.
6. Visualization with Grafana
Grafana aggregates data from Zabbix, StatsD, and other sources, offering APM dashboards and detailed response‑time charts.
7. Unified Tracking with Dapper
The team implemented an internal Dapper link‑tracing system, propagating a TrackID from the CDN entry through SQL queries, enabling fine‑grained latency analysis.
8. User‑Facing Monitoring: Misaka
Misaka extended error monitoring to the client side, providing a UI with historical comparisons for better analysis.
9. Alert Integration
Various monitors (Redis clusters, Kafka via Databus, Docker) were integrated into a unified timeline view, reducing the need to open multiple windows during incident investigation.
10. Moving to Prometheus
Due to growing monitoring requirements, the team migrated to Prometheus, keeping Grafana for visualization and gradually phasing out Zabbix. Prometheus provided a complete monitoring solution, handling MySQL metrics and more.
We continue on the road, looking forward to sharing with you.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
