How We Built a Three‑Layer Stability System for Massive Scale Operations
This article details the operational mindset, stability framework, and transformation journey of the Zuoyebang infrastructure team, covering service lifecycle management, standardization, cloud‑native architecture, multi‑active deployment, incident pre‑plan platforms, traffic scheduling, monitoring, capacity planning, and future directions for SRE service‑orientation.
Key Points
Operational mindset: Define the operational object, consider the full service lifecycle, segment stages, extract key points, build operational scenarios, and consolidate an ops platform.
Stability system: Adopt a "three‑in‑one" approach, prioritize fault prevention over emergency response, extend MTBF through standardization, container deployment, service governance, regulated changes, multi‑active architecture, and fault‑control; if prevention fails, embrace fault‑handling methods to minimize MTTR.
Transformation exploration: Cloud‑native impacts roles heavily; improve role awareness, refine capability models, service‑oriented ops, engineer new definitions, deeply adopt open‑source, and pursue common operational practices.
Background
Zuoyebang’s business spans tools, e‑commerce, live streaming, and smart hardware, supporting thousands of services, hundreds of namespaces, thousands of domains, and tens of thousands of nodes. The tech stack includes nearly ten languages, primarily Go and PHP. Two technical foundations support this scale: a cloud‑native architecture based on containers and service governance, and a multi‑cloud, multi‑active operational architecture built on dedicated lines.
Team division: Operations are organized by IaaS, PaaS, and SaaS layers. IaaS provides compute, network, and storage; PaaS is split into middleware (data storage, messaging), cloud‑native platform (containers, service governance), and security services. Corresponding operational roles include CloudOps/NetOps/FinOps for IaaS, MiddleOps for middleware, InfOps for cloud‑native, SecOps for security, and SRE for business applications.
Operational Practice
Google defines five service lifecycle stages: idea, architecture design, development, gradual/full rollout, and deprecation. From an ops perspective, standardization begins at architecture design, service admission at development, continuous delivery and operation during rollout, and service retirement at deprecation.
Standardization
1. Pre‑standardization: Enforce operational and architectural standards to achieve common ops, unified architecture, and out‑of‑the‑box solutions, preventing large‑scale stability issues.
2. Architecture reshaping: Embrace cloud‑native principles, center on business services, use containers to abstract resource differences, and service governance to smooth multi‑cloud disparities.
Operational focus
1. Ops‑centric: Consolidate services, split domains from shared public domains to independent ones for better traffic scheduling and container migration; define deployment environments (Tips, Small, Online) with consistent code, config, routing, and environment.
2. Architecture‑centric: Unify service discovery (SVC for east‑west, DNS for north‑south), migrate from BNS/ZNS to SVC, standardize PHP to ODP and Go to ZGIN frameworks, consolidate components (e.g., replace MemCache, NMQ), adopt unified communication protocols (HTTP, TCP).
Service Admission
Make service onboarding a multiple‑choice process rather than a fill‑in‑the‑blank, establishing standardization so services can use ready‑made operational support.
Continuous Delivery
Adopt three principles: embrace proven methodologies, restructure change responsibilities (business‑side changes stay with business, ops‑side changes stay with ops), and platform‑ify delivery to codify SOPs for low‑cost, interchangeable SRE actions.
Continuous Operation
Stability is built on three pillars: business quality, architecture unification, and ops control, each with dedicated services. Business iteration requires solid service governance; stable iteration needs disciplined ops control; ops must monitor architecture to prevent fragmentation, while architecture must consider ops impact.
Multi‑Active Architecture
Implement a multi‑active platform, collect multi‑dimensional operational data, and automate operations to reduce manual effort while ensuring safety for ToC traffic.
Pre‑plan Platform
Define pre‑plans as highly solidified complex operation sets, abstracted to one‑click blind‑switch capabilities, keeping business logic out of pre‑plans and ensuring high stability with multi‑master storage and no circular dependencies.
Traffic Scheduling
Ops handle global north‑south traffic, architecture handles east‑west long‑tail traffic, and business handles feature‑level traffic; scheduling is performed at the domain level using DOH weight‑based precise routing or DNS line‑based coarse routing.
Monitoring & Positioning
Combine monitoring alerts (ops layer) with service observability (service layer) by standardizing logs (four scenarios, eight log types), collecting basic, service, and business metrics, and registering trace IDs via Mesh for low‑cost, out‑of‑the‑box integration.
Build a logging center using a custom tiered storage solution instead of costly ELK, adopt Jaeger for tracing, and use Prometheus‑VictoriaMetrics with Grafana and OpenFalcon for monitoring.
Create dashboards that allow drilling from high‑level alerts to service details, mesh inbound/outbound calls, failure traces, capacity, and change events.
Capacity Monitoring
Monitor CPU/GPU, container host, cloud VM, subnet IP, dedicated line, and VPN capacities.
Other Operations
Construct a global service alarm dashboard for first‑line fault detection, enable quick queries of unresolved recent alerts, monitor critical permission changes to prevent over‑privilege, and enforce change governance based on five core rules.
Environmental Changes Driving Transformation
Domain change: Cloud‑native technologies have matured.
Internal change: The tech team pragmatically reshapes architecture following cloud‑native principles.
Industry change: Internet growth slows, giving businesses time to refine core capabilities.
Organizational change: Strong leadership promotes technical belief and responsibility culture.
These shifts force SREs to rethink roles, adopt engineering‑oriented ops, and shift from passive execution to proactive, data‑driven, automated service management.
Future Outlook
Capability upgrade: Continuously enhance technical skills to support cloud‑native demands.
Ops service‑orientation: Deepen the service‑oriented ops model to maximize business empowerment and sustain team growth.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
