How Meituan’s Phoenix SDK Enables Automatic Client‑Side CDN Failover
Meituan’s Phoenix solution equips web and native clients with an automatic CDN failover SDK, dynamic domain selection, and fine‑grained monitoring, dramatically improving resource loading success rates, reducing SRE workload, and ensuring high availability across millions of daily users.
Introduction
CDN has become critical infrastructure for modern web services, yet CDN outages still cause image loading failures, page white‑screens, and layout glitches. Traditional CDN disaster recovery is handled by SRE teams, which often cannot react quickly enough for localized failures. Meituan’s Phoenix project moves part of the disaster‑recovery logic to the client side.
Background
Meituan’s food‑delivery platform serves billions of requests daily, with more than 70% of static assets (JS, CSS, images, video) hosted on CDN. CDN incidents—such as edge‑node failures or domain bans—directly impact user experience and generate heavy SRE workload. Existing SRE‑centric monitoring aggregates data at a coarse granularity, masking small‑scale or regional issues.
Goals and Applicable Scenarios
Client‑side CDN domain auto‑switch : Detect CDN failure instantly on the client and retry with an alternate domain without manual intervention.
Domain isolation : Provide equivalent domains that can be switched transparently while preserving service semantics.
Fine‑grained CDN monitoring : Track CDN availability per project, region, and time slice, enabling dynamic adjustment of disaster‑recovery policies.
Continuous domain hot‑warm : Keep backup domains pre‑warmed to avoid back‑origin spikes during traffic shifts.
The solution targets any client‑side scenario that relies on CDN—Web, SSR Web, and Native applications.
Phoenix Architecture Overview
The Phoenix system consists of five major components:
Client‑side Disaster‑Recovery SDK : Senses resource load results, performs automatic CDN retries, and reports metrics.
Dynamic Calculation Service : Periodically evaluates CDN health per city, project, and time window, and reorders domain lists accordingly.
Disaster‑Recovery Monitoring Platform : Provides project‑level and global dashboards, minute‑level alerts, and detailed diagnostic data.
CDN Service : Supplies equivalent domains and enforces domain isolation at the network layer.
Disaster‑Recovery Configuration Platform : Manages domain configurations, reporting policies, and manual traffic interventions.
Client‑Side SDK Implementation
Web Side
The SDK focuses on JS, CSS, and image assets. It replaces traditional tag‑based loading with XHR requests wrapped by a custom PhoenixLoader Webpack plugin. This allows the SDK to inspect HTTP status codes and trigger retries on failure.
Design considerations include:
Generality : Must work across Meituan’s diverse front‑end stacks.
Ease of use : Low integration cost to encourage adoption.
Stability : Remain functional even when CDN availability fluctuates.
Non‑intrusiveness : Plug‑and‑play without breaking existing business logic.
For asynchronous chunks, the plugin rewrites Webpack’s async loading mechanism so that PhoenixLoader handles the request and fallback logic.
CSS assets are treated similarly by overriding mini-css-extract-plugin ’s async loader.
Native Side
Native clients (Android/iOS) primarily handle images, audio/video, and bundled resources. The SDK intercepts failed requests, then transparently re‑issues the request to a backup CDN domain. The process is invisible to the business layer: only one request is observed by the caller, and the final response is returned after a successful retry.
Key design points:
Convenience : Simple integration and clean removal.
Compatibility : Works with Retrofit, OkHttp3, URLConnection on Android and with NSURLProtocol on iOS.
Extensibility : Allows custom monitoring data to be reported.
Dynamic Calculation Service
The service links a domain pool with project AppKey and computes per‑region availability every five minutes using recent load‑result reports. It adjusts domain ordering based on success‑rate and traffic share, applying a transfer‑baseline algorithm to shift traffic toward higher‑success domains while preserving a small random portion for lower‑success domains.
Example: Domains A (99% success, 90% traffic), B (98% success, 6% traffic), C (97.8% success, 4% traffic) are re‑balanced to A 95%, B 3.5%, C 1.5% after calculation.
Monitoring Platform
Metrics are reported per project, app, resource, and domain. The platform aggregates data into a minute‑level CDN availability dashboard, exposing dimensions such as region, ISP, network condition, and HTTP status code. Alerts trigger within minutes of an anomaly, offering richer context than traditional SRE dashboards.
Results
After one year of deployment, Phoenix handles over 30 million daily disaster‑recovery requests, saving more than 350 k user sessions. Business success rates for image loading rose from 99.7% to 99.9% on average. In A/B tests, Android and iOS apps that enabled Phoenix showed a noticeable reduction in failure‑induced white‑screens and layout glitches.
Conclusion and Outlook
Phoenix has become Meituan’s sole public CDN disaster‑recovery service, supporting over 200 projects across food‑delivery, travel, and e‑commerce lines. It has dramatically reduced manual CDN switch‑overs, lowered SRE operational pressure, and improved end‑user experience. Future work includes expanding SDK coverage to more front‑end frameworks, enhancing resource‑signature verification, and open‑sourcing the dynamic calculation service to encourage broader adoption.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
