Operations 18 min read

How Dazhong Dianping Scaled Operations: Architecture, Automation, and Lessons Learned

This article summarizes the key insights from Dazhong Dianping's operations talk, covering team organization, multi‑datacenter architecture, comprehensive monitoring, automation workflows, configuration management tools, incident analysis systems, common pitfalls, and future directions such as PaaS and Docker adoption.

Efficient Ops
Efficient Ops
Efficient Ops
How Dazhong Dianping Scaled Operations: Architecture, Automation, and Lessons Learned

Guest Introduction

Zhang Guanyu (nickname "Guan Yu"), an operations architect at Dazhong Dianping, shares the evolution of the company's operations from inception to high efficiency.

1. Operations Team Structure

Dazhong Dianping's operations are divided into four groups—Application Operations, System Operations, Operations Development, and Monitoring Operations—plus DBA and Security teams, totaling fewer than 40 members.

Application Operations: Supports online services, ensures stability, collaborates with developers, and continuously optimizes services.

Operations Development: Builds tools to improve operational efficiency and automate processes.

System Operations: Handles OS customization, IDC management, machine provisioning, and account management.

Monitoring Operations: Detects faults, notifies owners, and initiates mitigation or degradation procedures.

2. Overall Architecture

Dazhong Dianping runs a dual‑datacenter setup: A‑datacenter for production, B‑datacenter for testing and big‑data workloads (Hadoop, log backup, disaster‑recovery). The infrastructure comprises roughly ten thousand physical and virtual machines.

The layered architecture includes:

Third‑party intelligent DNS + CDN at the user‑guidance layer.

F5 L4 load balancing, followed by Dengine (a custom L7 balancer) and Varnish caching before requests reach the web tier, which calls services via internal RPC.

MogileFS for distributed image storage.

High‑availability design with at least two instances for every service.

3. Operations Systems Overview

3.1 Comprehensive Monitoring

Monitoring covers four dimensions:

Business metrics (e.g., QPS, payment rate, order creation) via

cat

.

Application metrics (error counts, latency, 95th percentile) via

cat

.

System resources (CPU, memory, swap, disk, load) via Zabbix.

Network health (packet loss, ping, traffic, TCP connections) via Zabbix and

cat

.

Key dashboards display business‑level charts, application‑level error maps, and end‑to‑end request traces, enabling rapid root‑cause identification.

3.2 Automation Workflow System

The workflow platform standardizes all online changes (e.g., scaling, deployments, memory dumps, IP blocking) into programmable processes. Users submit requests, operations review them, and the system executes automatically, sending email notifications upon completion. Over 98% of changes now flow through this platform, providing audit trails and data for continuous improvement.

3.3 Configuration and Management

A web‑based Puppet management tool parses Puppet syntax, enforces naming conventions, and presents modules as reusable method sets. A soft‑load‑balancer UI translates Nginx configuration into XML for web management, offering version control, rollback, and safe editing.

The Lion configuration system stores all application settings as key/value pairs in Zookeeper, propagating changes to running services in real time.

3.4 Record and Analysis

Incident records are captured in a fault‑analysis system, reviewed regularly, and fed into a DOM quality‑management platform that aggregates server health, response metrics, resource utilization, and business‑level incidents. A radar system under development will classify and prioritize faults using contextual algorithms.

4. Pitfalls and Improvements

Untracked changes caused major outages; introduced permission‑controlled tools and workflow approvals.

Ambiguous blame between dev and ops; deployed DOM and

cat

for deep diagnostics.

Erroneous commands impacted the entire line; automated Go‑based platform reduced manual interventions.

Slow fault localization; radar system provides instant context‑aware fault mapping.

Ops workload imbalance; self‑service tools freed ops to focus on platform refinement and quality monitoring.

5. Future Focus

Upcoming initiatives include PaaS development, extensive Docker adoption (thousands of containers with sub‑10‑second deployment and sub‑30‑second migration), and advanced strategy layers for rapid scaling, migration, recovery, and intelligent policy enforcement.

Conclusion

Dazhong Dianping's operations have transitioned from manual, root‑login scripts to a highly automated, platform‑driven ecosystem that emphasizes standardization, auditability, and developer self‑service.

MonitoringAutomationoperationsDevOpsPlatforminfrastructure
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.