How Dazhong Dianping Scaled Operations: Architecture, Automation, and Lessons Learned
This article summarizes the key insights from Dazhong Dianping's operations talk, covering team organization, multi‑datacenter architecture, comprehensive monitoring, automation workflows, configuration management tools, incident analysis systems, common pitfalls, and future directions such as PaaS and Docker adoption.
Guest Introduction
Zhang Guanyu (nickname "Guan Yu"), an operations architect at Dazhong Dianping, shares the evolution of the company's operations from inception to high efficiency.
1. Operations Team Structure
Dazhong Dianping's operations are divided into four groups—Application Operations, System Operations, Operations Development, and Monitoring Operations—plus DBA and Security teams, totaling fewer than 40 members.
Application Operations: Supports online services, ensures stability, collaborates with developers, and continuously optimizes services.
Operations Development: Builds tools to improve operational efficiency and automate processes.
System Operations: Handles OS customization, IDC management, machine provisioning, and account management.
Monitoring Operations: Detects faults, notifies owners, and initiates mitigation or degradation procedures.
2. Overall Architecture
Dazhong Dianping runs a dual‑datacenter setup: A‑datacenter for production, B‑datacenter for testing and big‑data workloads (Hadoop, log backup, disaster‑recovery). The infrastructure comprises roughly ten thousand physical and virtual machines.
The layered architecture includes:
Third‑party intelligent DNS + CDN at the user‑guidance layer.
F5 L4 load balancing, followed by Dengine (a custom L7 balancer) and Varnish caching before requests reach the web tier, which calls services via internal RPC.
MogileFS for distributed image storage.
High‑availability design with at least two instances for every service.
3. Operations Systems Overview
3.1 Comprehensive Monitoring
Monitoring covers four dimensions:
Business metrics (e.g., QPS, payment rate, order creation) via
cat.
Application metrics (error counts, latency, 95th percentile) via
cat.
System resources (CPU, memory, swap, disk, load) via Zabbix.
Network health (packet loss, ping, traffic, TCP connections) via Zabbix and
cat.
Key dashboards display business‑level charts, application‑level error maps, and end‑to‑end request traces, enabling rapid root‑cause identification.
3.2 Automation Workflow System
The workflow platform standardizes all online changes (e.g., scaling, deployments, memory dumps, IP blocking) into programmable processes. Users submit requests, operations review them, and the system executes automatically, sending email notifications upon completion. Over 98% of changes now flow through this platform, providing audit trails and data for continuous improvement.
3.3 Configuration and Management
A web‑based Puppet management tool parses Puppet syntax, enforces naming conventions, and presents modules as reusable method sets. A soft‑load‑balancer UI translates Nginx configuration into XML for web management, offering version control, rollback, and safe editing.
The Lion configuration system stores all application settings as key/value pairs in Zookeeper, propagating changes to running services in real time.
3.4 Record and Analysis
Incident records are captured in a fault‑analysis system, reviewed regularly, and fed into a DOM quality‑management platform that aggregates server health, response metrics, resource utilization, and business‑level incidents. A radar system under development will classify and prioritize faults using contextual algorithms.
4. Pitfalls and Improvements
Untracked changes caused major outages; introduced permission‑controlled tools and workflow approvals.
Ambiguous blame between dev and ops; deployed DOM and
catfor deep diagnostics.
Erroneous commands impacted the entire line; automated Go‑based platform reduced manual interventions.
Slow fault localization; radar system provides instant context‑aware fault mapping.
Ops workload imbalance; self‑service tools freed ops to focus on platform refinement and quality monitoring.
5. Future Focus
Upcoming initiatives include PaaS development, extensive Docker adoption (thousands of containers with sub‑10‑second deployment and sub‑30‑second migration), and advanced strategy layers for rapid scaling, migration, recovery, and intelligent policy enforcement.
Conclusion
Dazhong Dianping's operations have transitioned from manual, root‑login scripts to a highly automated, platform‑driven ecosystem that emphasizes standardization, auditability, and developer self‑service.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.