How to Master Multi‑Cloud Operations: Lessons from a Gaming Company’s Hybrid Architecture
This talk shares a senior director’s experience building a hybrid multi‑cloud infrastructure for a game company, covering stability, efficiency, cost challenges, design‑for‑failure principles, standardization, resource automation, and the cultural and organizational factors that affect successful cloud operations.
1. Self Introduction and Current Situation
The speaker, a senior director at Shanghai Kaier Network Technology Co., has over ten years of experience in operations and technical support, working on video streaming, online transaction systems, and game platforms, handling IAAS and PAAS at massive scale.
Kaier Network is a ten‑year‑old game company with a hybrid infrastructure: a mix of private IDC and multiple public clouds forming an elastic computing platform.
2. Problems
The hybrid environment suffers from historical baggage: five physical IDC locations, six public clouds, and seven CDN providers, leading to unclear positioning, high management cost, and chaos.
Key issues include:
Stability: major cloud outages and IDC power failures cause long‑lasting service interruptions.
Efficiency: complex underlying environment hampers rapid deployment and release.
Cost: fragmented resources increase waste and hinder negotiation.
Additional concerns involve inconsistent configuration management, permission revocation, and difficulty tracking resource creation, usage, and cost allocation.
3. Multi‑Cloud Approach
Three guiding principles:
Design for failure – assume any cloud can crash and build resilience.
Simplicity – reduce the number of clouds and IDC locations to what is truly needed.
Standardization – enable automation, platformization, and data‑driven AIOps.
The architecture separates workloads into edge nodes (game zones) and core nodes (platform services). Edge nodes run isolated VMs with nginx, C++, cache, and DB, supporting 3‑5k concurrent players; core nodes host high‑availability services like registration, login, and payment.
A resource mapping table aligns internal server models with equivalent configurations across clouds, hiding provider differences.
Resource Gateway acts as the central automation layer: it abstracts cloud APIs, creates resources, registers them in CMDB, and ensures consistent initialization (monitoring agents, kernel tuning, etc.).
Monitoring, logging, and cost dashboards provide visibility across providers, business units, and resource types, supporting data‑driven decisions.
4. Challenges and Responses
Technical and business priorities often clash; stability, efficiency, and cost cannot be solved by multi‑cloud alone.
Organizational structure influences architecture (Conway’s Law). Balancing technical debt with business growth requires pragmatic trade‑offs.
Key operational practices include:
Defensive programming: distinguish critical and non‑critical paths, apply degradation and circuit‑breaker patterns.
Rate limiting and graceful degradation during traffic spikes.
Robust monitoring and alerting (e.g., Prometheus, Zabbix) with intelligent aggregation.
Chaos engineering and regular disaster‑recovery drills.
Security must be addressed through top‑down policies, incident response, and coordination with regulators.
5. Insights
Future clouds will become commodity infrastructure like water and electricity; IAAS/PAAS providers will diminish. Engineers should focus on data, quantifiable metrics, and aligning technical solutions with business goals.
Design for failure, respect Conway’s Law, and maintain clear ownership and responsibility across teams to sustain reliable, cost‑effective multi‑cloud operations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
