Inside Dianping’s Ops: Building Scalable Monitoring, Automation, and Self‑Service Platforms
This article details how Dianping’s sub‑40‑person operations team structures its groups, designs a dual‑datacenter architecture, and creates comprehensive monitoring, automation, configuration, and analysis systems—including Zabbix, Cat, workflow, Button, and a custom radar platform—to achieve high‑availability, self‑service, and continuous improvement.
The content is compiled from the "Efficient Ops" series, featuring guest speaker Zhang Guanyu, an operations architect at Dianping, who shares the evolution of Dianping’s operations from a small team to a mature, automated platform.
1. Operations Team Structure
Dianping’s ops are divided into four groups: Application Ops, System Ops, Ops Development, and Monitoring Ops, plus DBA and Security teams, totaling fewer than 40 members.
2. Overall Architecture
Dianping runs a dual‑datacenter setup: the A‑datacenter hosts production workloads, while the B‑datacenter handles testing and big‑data jobs (Hadoop, log backup, disaster‑recovery). The architecture follows a multi‑layer model with DNS+CDN, F5 and Dengine load balancers, Varnish caching, web front‑ends, RPC‑based services, and distributed storage (MogileFS). All services are deployed with high‑availability (minimum two instances).
Monitoring spans four dimensions: business metrics (e.g., transaction rates), application metrics (error counts, latency), system resources (CPU, memory, disk, processes), and network health (packet loss, ping, connections), using tools such as Cat and Zabbix.
3. Operations Systems
3.1 Comprehensive Monitoring System
Beyond Zabbix, Dianping heavily uses Cat for business and application monitoring, providing real‑time dashboards for transaction rates, error spikes, and service health. A Logscan tool scans logs based on configurable policies, complementing Zabbix and Cat for security‑related detections.
3.2 Automation Workflows
The workflow system standardizes change processes (initiate, audit, execute, verify) and automates tasks such as capacity expansion, deployments, memory dumps, and IP blocking. Over 98% of changes now flow through this platform, generating audit trails and enabling data‑driven optimization.
The Button system offers end‑to‑end code management, packaging, gray‑release, and rollback without operator intervention. A Go‑based operations platform provides one‑click actions for batch restarts, degradations, switches, and health checks, unifying disparate scripts and tools.
A custom job scheduler acts as a distributed crontab with a rich UI, handling periodic tasks, status reporting, and automatic retries.
3.3 Configuration and Management
Dianping built a web‑based Puppet management UI to parse and standardize Puppet manifests, reducing syntax errors and improving collaboration.
The Soft Load Balancer UI parses Nginx configurations into XML for web‑based, fool‑proof editing, version control, and rollback.
Lion, a key‑value configuration service backed by Zookeeper, stores all application settings centrally, enabling real‑time propagation of configuration changes to running services.
3.4 Recording and Analysis
A fault‑record system logs every incident, conducts post‑mortems, and feeds insights back into platform improvements.
The DOM (Operations Quality) platform aggregates server health, application response, resource utilization, and business incident data, allowing teams to benchmark and drive performance improvements.
The Radar system classifies historical incidents, applies algorithmic strategies, and surfaces root‑cause patterns to accelerate troubleshooting.
4. Pitfalls and Improvements
Untracked changes caused major outages; introduced permission‑controlled Puppet tools and workflow approvals.
Ambiguous blame between dev and ops; deployed DOM and Cat for deep diagnostics.
Erroneous commands impacted services; migrated 90% of operations to automated Go platform.
Slow incident triage; built Radar for rapid root‑cause visualization.
Ops overload; shifted 98% of routine tasks to self‑service platforms, freeing engineers for higher‑value work.
5. Future Directions
Dianping is exploring PaaS, containerization, and orchestration. With thousands of Docker instances, the focus is on rapid deployment (≈10 s), seamless migration (≈30 s), and intelligent policy layers for scaling, recovery, and degradation.
Overall, the journey illustrates how a small ops team can achieve platform‑level automation, self‑service, and continuous improvement through thoughtful architecture and tooling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
