Operations 11 min read

How We Scaled a Live‑Streaming Platform from 10K to 1M Concurrent Users in 3 Days

This article recounts how a pandemic‑era live‑streaming service rapidly expanded from ten‑thousand to one‑million concurrent viewers within three days by analyzing the pre‑deployment assessment, container‑based scaling, monitoring, emergency response plans, and post‑launch optimizations.

Efficient Ops

Mar 8, 2020

How We Scaled a Live‑Streaming Platform from 10K to 1M Concurrent Users in 3 Days

Background

On the afternoon of January 27, the operations team received a request to move a daily insurance agent morning meeting from on‑site to an online live broadcast, requiring a rapid increase in concurrent users from ten‑thousand to one‑million.

Pre‑deployment Preparation – ① System Status Review

Business scenario analysis : Ensure login, stream push/pull, and core live‑stream functions are stable; secondary features like rewards, chat, and forums are less critical.

Application architecture review : Map upstream/downstream component calls, assess whether high‑frequency DB access is suitable, consider asynchronous or cache‑based designs, and identify critical versus value‑added services for possible degradation.

Configuration review : Document cluster topology, call strategies, URLs, network bandwidth, firewall rules, and I/O settings to enrich CMDB and feed monitoring.

Pre‑deployment Preparation – ② Expansion Plan & Load Testing

Because the core components had been containerized since 2019, dynamic scaling was possible. An expansion plan compared current capacity with projected load, and resources such as network, storage, and CDN were scaled end‑to‑end. Load‑testing and monitoring validated the plan before the January 31 launch.

Pre‑deployment Preparation – ③ Monitoring Solution Review & Deployment

Monitoring covered user experience, service chain, application health, infrastructure, and business trends, following four basic principles: existence, liveliness, availability, and efficiency.

Pre‑deployment Preparation – ④ Emergency Plan Consolidation

Two‑level emergency plans were defined: business‑level (traffic shaping, degradation, rate‑limiting) and IT‑component level (isolation, restart, failover, feature‑toggle). All actions were broken down to executable command granularity.

During the Event – Vigilant Monitoring

Operations performed nightly system checks at 05:00 and held morning stand‑ups at 07:00 with product, development, and monitoring teams. Real‑time monitoring of resources, performance spikes, logs, and third‑party services enabled immediate anomaly detection and response.

Pre‑defined traffic‑shaping, rate‑limiting, and degradation plans proved essential when a surge of user‑generated short videos strained resources; the emergency plan was triggered to keep the broadcast stable.

Continuous post‑mortems at 16:00 and frequent sync calls ensured issues were addressed promptly, often extending into overnight work cycles.

Post‑event Continuous Optimization

After two weeks of targeted optimizations, daily live sessions stabilized at around 840,000 concurrent users, with total daily participation reaching three million.

Key improvements included strengthening platform security, building a health‑metric system, and accelerating response times through architectural refinements.

Transparent communication and well‑defined role responsibilities were crucial for rapid coordination, while automated scripts and tools replaced manual operations for a server fleet exceeding 1,000 nodes.

Lessons learned highlighted both the advantages of container‑based elastic scaling and the need to address side effects such as request‑forwarding anomalies and tuning gaps.

Conclusion

The team thanks all heroes behind the “cloud battle‑against‑epidemic” effort and invites further discussion on large‑scale event assurance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Live Streaming Operations system scaling

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.