Operations 8 min read

How Bilibili Scaled Its Ops: From DIY Deployments to Prometheus Monitoring

From early manual deployments to a sophisticated, multi-layered monitoring stack—including ELK, Zabbix, Statsd, Grafana, and Prometheus—Bilibili’s ops team shares the evolution, challenges, and lessons learned in building scalable, automated infrastructure for massive internet traffic.

Efficient Ops
Efficient Ops
Efficient Ops
How Bilibili Scaled Its Ops: From DIY Deployments to Prometheus Monitoring

Preface

With the rapid growth of the internet, data volumes increase and operations become more critical. Attending the GOPS2017 Global Ops Conference in Shenzhen, the author shares Bilibili's ops monitoring system development over the past year.

1. Automated Deployment

In 2015 the ops team was newly formed and overwhelmed. They built a simple deployment system using OpenLDAP for authentication, GitLab for code hosting, Jenkins for builds, and Ansible for scripting, all wrapped in a command‑line interface.

2. Necessity of Monitoring

The typical backend architecture includes CDN, LVS load balancers, Tengine SLB, caches, queues, and databases. Languages such as Go, Java, PHP, Python, and C/C++ are used. Lack of monitoring made fault diagnosis difficult.

3. First Alert: ELK

The team collected error logs from the CDN, indexed them in Elasticsearch by domain, node, user IP, and error code. Alerts via SMS and WeChat allowed quick fault localization.

4. Basic Monitoring with Zabbix

After CDN monitoring, the team adopted Zabbix for foundational monitoring because it is quick to deploy, offers flexible alert strategies, and has a large user base.

5. Application Monitoring with StatsD

StatsD provides lightweight metric collection via UDP, avoiding impact on the monitored program. Integrated with Graphite, it produces graphs for various metrics.

6. Visualization with Grafana

Grafana aggregates data from Zabbix, StatsD, and other sources, offering APM dashboards and detailed response‑time charts.

7. Unified Tracking with Dapper

The team implemented an internal Dapper link‑tracing system, propagating a TrackID from the CDN entry through SQL queries, enabling fine‑grained latency analysis.

8. User‑Facing Monitoring: Misaka

Misaka extended error monitoring to the client side, providing a UI with historical comparisons for better analysis.

9. Alert Integration

Various monitors (Redis clusters, Kafka via Databus, Docker) were integrated into a unified timeline view, reducing the need to open multiple windows during incident investigation.

10. Moving to Prometheus

Due to growing monitoring requirements, the team migrated to Prometheus, keeping Grafana for visualization and gradually phasing out Zabbix. Prometheus provided a complete monitoring solution, handling MySQL metrics and more.

We continue on the road, looking forward to sharing with you.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringOperationsDevOpsPrometheusELKGrafana
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.