How Meituan’s Phoenix SDK Enables Client‑Side CDN Disaster Recovery
This article explains Meituan's Phoenix solution that moves CDN disaster recovery to the client side, detailing its goals, architecture, dynamic calculation service, monitoring platform, implementation for web and native apps, and the measurable improvements in availability and operational efficiency.
1. Introduction
CDN has become essential infrastructure for the Internet, and many services rely on it; its stability directly affects business availability. Meituan's SRE team traditionally handles CDN disaster recovery, but client‑side solutions were lacking.
2. Background
CDN accelerates static resources such as JS, CSS, images, video, and audio, but CDN failures can cause page white‑screens, layout errors, and image loading failures. Monitoring CDN from the SRE side is difficult due to the wide distribution of edge nodes, and small‑traffic or regional issues are often hidden in aggregated dashboards.
3. Goals and Scenarios
3.1 Core Goals
Client‑side CDN domain auto‑switch : Detect CDN problems instantly on the client and retry with alternative domains without manual intervention.
Domain isolation : Ensure equivalent CDN domains are isolated by region while providing the same service.
Precise CDN monitoring : Build fine‑grained monitoring per project to reduce alert latency and adjust disaster‑recovery strategies dynamically.
Continuous hot‑standby : Keep each CDN domain warm to avoid back‑origin traffic spikes during switches.
3.2 Applicable Scenarios
All client‑side contexts that depend on CDN—Web, SSR Web, and Native—can benefit from this approach.
4. Phoenix Solution
The Phoenix client‑side CDN disaster‑recovery scheme consists of five parts: a client‑side SDK, a dynamic calculation service, a monitoring platform, CDN services with isolated domains, and a configuration platform.
4.1 Overall Design
The SDK senses resource loading results, performs automatic CDN domain switching, and reports metrics. The dynamic calculation service periodically polls SDK reports, evaluates domain availability per city, project, and time slice, and reorders domains to direct traffic to the most reliable CDN. The monitoring platform visualizes CDN health at project, region, and ISP levels.
4.2 Disaster‑Recovery Flow
When a resource fails to load from a primary CDN domain, the SDK retries using a list of backup domains until success or exhaustion, reducing reliance on manual SRE switches.
4.3 Implementation Details
4.3.1 Client‑Side SDK (Web)
Static resources are loaded via XHR instead of traditional tags, allowing status‑code based success detection. Webpack extracts synchronous resources and loads them through a custom PhoenixLoader, while asynchronous resources are intercepted and re‑routed similarly. The SDK is packaged as a Webpack plugin to ensure broad compatibility, ease of integration, stability, and low intrusiveness.
4.3.2 Dynamic Calculation Service
The service links domain pools with project AppKeys, aggregates loading results within five‑minute windows, and computes availability per city and province. It then redistributes traffic based on success‑rate differentials (e.g., transferring a portion of traffic from lower‑success domains to higher‑success ones) to achieve a smooth, optimal domain ordering.
4.3.3 Monitoring
Metrics are collected by project, app, resource, and domain, forming a CDN‑availability dashboard that provides minute‑level alerts and detailed diagnostics (region, ISP, response code). This granularity enables faster detection of localized CDN issues compared to traditional SRE dashboards.
4.3.4 CDN Service Enhancements
CDN services now support domain isolation and provide equivalent domains (e.g., cdn1.meituan.net and cdn2.meituan.net) that return identical content, ensuring that client‑side switches remain effective without causing back‑origin overload.
5. Results and Outlook
After one year, Phoenix handles over 30 million daily disaster‑recovery requests, saving more than 350 k users across Meituan’s food‑delivery, travel, and other businesses, and is integrated into over 200 projects including the Meituan and Dianping apps. The solution provides minute‑level, project‑specific alerts, dramatically reducing manual SRE interventions and improving overall CDN availability.
Future work includes expanding SDK compatibility with more front‑end frameworks, open‑sourcing the dynamic calculation service, and enhancing resource verification, intelligent switching, and performance optimizations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
