How 58 Daojia Built a Cloud‑Native Ops Platform to Streamline Migration and Cut Costs
This article recounts 58 Daojia’s four‑year journey from migrating its IDC infrastructure to public cloud, the challenges encountered, and how the team designed and evolved a multi‑generation operations platform that centralizes asset, cost, domain, and monitoring management, ultimately improving efficiency and reducing expenses.
Background
Public cloud has become a mature, stable and cost‑effective option for many small‑to‑medium internet companies. In early 2016 58 Daojia decided to migrate all workloads from traditional IDC to a public‑cloud environment to reduce capital expenditure, simplify maintenance and improve reliability.
Migration Process ("Lingyun" Project)
The migration lasted 114 days, moving more than 2 TB of data, over 160 services and 70 databases. Traffic was shifted gradually using nginx upstream configuration, which allowed a smooth cut‑over without DNS‑level changes and provided an instant rollback path. Major incidents encountered during the migration included a prolonged public‑cloud backbone outage, HAVIP database failures lasting over two hours, and a rapid increase in monthly cloud spend (costs doubled within a few months).
First‑Generation Ops Platform (Oct 2016)
To address asset ownership, cost accounting, NAT permission and domain‑query problems, a first‑generation platform was built. It provided a centralized view of servers, databases and DNS entries and automated many manual processes that previously relied on spreadsheets.
Second‑Generation Ops Platform (Apr 2019)
With Python developers joining the ops team, a second‑generation platform was launched, adding a suite of functional modules:
Cost Center – Exports department‑level asset and expense data, enabling transparent cost visibility and policy enforcement.
Asset Management (Servers) – Tracks ownership, utilization (CPU, memory) and provides deployment suggestions such as “de‑provision if CPU < 40 %”.
CDN File Refresh – Self‑service static‑file refresh via cloud CDN API with role‑based permission control.
Domain Management – Unified UI for internal DNS, public‑cloud DNS and commercial DNS providers.
Monitoring Integration – Embeds Grafana dashboards for real‑time server metrics.
Cluster Domain Management – Keyword/port/IP queries and HTTP APIs for adding/removing domains and clusters.
User & System Configuration – Role‑based access control for each module.
Site Navigation – Quick links to request forms, bastion host, job tickets and internal commands.
Key Technical Modules
Cost Center
Aggregates cloud‑provider billing data with Zabbix usage metrics. Policies such as “de‑provision servers with CPU < 40 %” are enforced automatically, and periodic expense reports are generated per department.
Asset Management
Maintains a database of server ownership, tags, and utilization statistics. Provides queries like “which department is generating traffic from IP X.X.X.X?” and suggests capacity adjustments.
CDN Refresh Service
Calls the cloud provider’s CDN purge API from the platform UI. Permissions are checked against the user’s role; abusive refresh attempts are logged and can be blocked.
Domain Management
Consolidates internal DNS, public‑cloud DNS and third‑party DNS (e.g., DNSPod) into a single interface. Adding, updating or deleting a domain updates all underlying providers via API calls.
Monitoring Integration
Grafana dashboards are embedded directly in the platform, allowing engineers to view server‑level metrics (CPU, memory, network) without leaving the ops portal.
Cluster Domain Management
Provides a searchable catalogue of domain‑to‑cluster mappings. HTTP APIs enable automated addition/removal of domains when clusters scale up or down.
User & System Configuration
Implements role‑based access control (RBAC) at the module level, ensuring that only authorized teams can modify NAT rules, DNS records or cost‑center data.
Site Navigation
Centralizes links to request forms, bastion host, job ticket system and common command‑line utilities, reducing context‑switching for both developers and ops engineers.
Cost Management Implementation
The cost‑center module pulls billing data from the cloud provider’s cost‑center API and merges it with real‑time utilization metrics collected by Zabbix. A policy engine evaluates thresholds (e.g., CPU < 40 % for > 7 days) and automatically creates de‑provision tickets. Exported CSV/Excel reports are sent to each department on a quarterly basis.
Multi‑Cloud Connectivity Guidance
For hybrid or multi‑cloud environments, the authors recommend using third‑party interconnect services or dedicated IDC‑to‑cloud leased lines. If an on‑premise IDC exists, separate dedicated links to each cloud provider can be aggregated at the IDC to form a private backbone.
Monitoring Strategy
Infrastructure (servers, network devices) is monitored with Zabbix or Open‑Falcon. Container and Kubernetes workloads will be monitored with Prometheus. For Java services, Meituan’s open‑source CAT framework is suggested. The platform plans to expose a “monitoring‑center” module that lets product owners add custom monitors via a UI.
Future Roadmap
Planned enhancements include deeper automation for container and Kubernetes workloads, refined cost‑optimization rules, and expanded self‑service capabilities for additional business teams.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
