Dianping Operations Architecture Overview and Best Practices
This article presents a comprehensive overview of Dianping's operations architecture, detailing team organization, multi‑data‑center infrastructure, monitoring layers, automation tools, configuration management systems, incident analysis, lessons learned, and future directions such as Docker and PaaS adoption.
The author, a senior operations architect at Dianping, introduces the company's operations organization, which consists of four core groups—application operations, system operations, operations development, and monitoring operations—plus DBA and security teams, totaling around 40 engineers.
Dianping operates a dual‑data‑center setup: the primary A‑center runs production services, while the B‑center hosts testing environments and big‑data workloads, with nearly ten thousand physical and virtual machines managed in a multi‑tiered architecture that includes third‑party DNS/CDN, F5 and Dengine load balancers, Varnish caching, and internal RPC services.
Monitoring is performed across four dimensions: business metrics (e.g., transaction rates), application metrics (error counts, latency), system resources (CPU, memory, disk, processes), and network health (packet loss, latency), using tools such as Cat, Zabbix, and custom dashboards.
The operations team has built several automation platforms: a full‑stack monitoring system, an automated tool suite for repetitive tasks, a workflow engine that standardizes change processes, a Button system for code packaging and deployment, a Go‑based platform for one‑click operations, and a distributed job scheduler resembling a crontab with web UI.
Configuration management is handled through a web‑based Puppet parser, a soft‑load‑balancer manager that converts Nginx configs to XML, and the Lion system that stores key/value application settings in ZooKeeper, enabling real‑time updates across services.
Incident recording and analysis are centralized in a fault‑tracking system that logs root causes and remediation steps, while the DOM platform aggregates performance and reliability metrics for cross‑team benchmarking.
The article also shares common pitfalls encountered—such as undocumented changes, ambiguous ownership of failures, and manual command errors—and explains how the introduced tools mitigate these issues.
Looking ahead, Dianping is focusing on PaaS technologies, extensive Docker deployment (thousands of containers), rapid provisioning, migration, and intelligent strategy layers to further streamline operations.
In conclusion, the author emphasizes the shift from manual, inefficient operations to a platform‑centric, automated, and self‑service model that aligns closely with development teams.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
