How Nanguo Film Migrated 30+ Services to Alibaba Cloud Serverless in Just 7 Days
In a seven‑day sprint, Nanguo Film transformed its entire streaming platform by moving over 30 systems to Alibaba Cloud's Serverless Application Engine, cutting operational effort by 70%, reducing costs by more than 40%, and achieving ten‑fold faster scaling while maintaining zero downtime.
Pain Points
The original architecture ran entirely on Alibaba Cloud ECS instances. Operational bottlenecks included:
Slow elastic scaling – during traffic spikes a new ECS had to be purchased and manually provisioned, causing SLA violations.
Lengthy, error‑prone release cycles – hundreds of servers required manual updates for each deployment.
High maintenance overhead – operations required expertise in Lua/Ansible scripts, cloud networking, and monitoring.
Poor resource utilization – capacity was sized for peak load, leaving most of the fleet idle during off‑peak periods.
Complex permission management – RAM policies were applied at the machine level, making multi‑tenant access cumbersome.
Selection Process
Three migration paths were evaluated:
Deep script optimization : could automate some tasks but still depended on skilled ops personnel and manual ECS procurement.
Self‑built Kubernetes : offered high density and auto‑scaling but required a steep learning curve and a dedicated ops team.
Alibaba Cloud Serverless Application Engine (SAE) : provided instant WAR/JAR deployment, unlimited elastic resources, and minimal operational overhead.
SAE was chosen as the final solution.
Implementation Rounds
Round 1 – CI/CD Pipeline
Integrated Travis CI with SAE to replace the ECS deployment workflow. The pipeline performs:
Run unit tests on each commit.
Upload build artifacts to a private OSS bucket.
Deploy the artifact to SAE using the same deploy step that previously targeted ECS.
SAE supports single‑batch, canary, and rollback strategies, enabling fast, reliable releases.
Round 2 – First Application Migration (API Gateway)
The API gateway, the highest‑traffic service, was selected first because it already spanned multiple regions and could run in parallel on ECS and SAE. Traffic was gradually shifted to SAE while keeping the ECS instances as a hot standby.
Round 3 – Auto‑Scaling Under Surge
A stress test using five times the traffic of a blockbuster release was executed. SAE auto‑scaling rules were configured with thresholds for CPU, memory, QPS, and response time. SAE scaled out within seconds and scaled back down during low load, delivering roughly 40% hardware cost savings compared with a permanently provisioned ECS fleet.
Round 4 – Full‑Link Monitoring & Diagnosis
SAE’s built‑in ARMS monitoring provides:
Topology maps of service calls.
Slow‑SQL and slow‑service detection.
Method‑level call‑stack traces.
Top‑N application reports for quick prioritization.
These features reduced troubleshooting time dramatically.
Round 5 – Enterprise‑Grade Permission Isolation & Approval
Permission management shifted from machine‑level RAM policies to application‑level roles. A single grant per application is sufficient. SAE also enforces a main‑account approval workflow for any sub‑account operation, preventing unauthorized changes.
Round 6 – Completion
Within seven days all 30+ services (hundreds of servers) were fully migrated to SAE. The migration required only 1–2 developers and incurred zero incidents.
Results & Benefits
Scaling speed increased from hours to seconds; no over‑provisioning or under‑provisioning.
Release cycles accelerated via CI/CD and one‑click CloudToolkit deployments.
Operations became largely hands‑off; alerts trigger automatic remediation.
Integrated monitoring shortened problem‑diagnosis time.
Overall development efficiency improved by ~70%, cost reduced >40%, and scaling efficiency grew >10×.
Key Takeaways
Deploy applications across multiple availability zones for resilience.
Use batch, canary, or gray‑release strategies for multi‑instance services.
Implement health‑check scripts and run them before deployment to avoid start‑up failures.
Derive scaling thresholds from thorough load‑testing; prefer conservative (lower) thresholds to prevent outages.
Configure SLS logging and ARMS alerts to enable effective post‑incident analysis.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
