What Went Wrong with Our Multi‑User Live Audio Feature? A Post‑Mortem and Lessons Learned
This post‑mortem reviews a chaotic multi‑person live‑audio incident, detailing its background, symptoms, timeline, root causes, and the improvement plan, offering practical insights for better release planning, gray‑release strategies, and risk awareness in software operations.
Background
The multi‑audio ("连麦") feature is widely used by instructors to boost classroom interaction. After supporting single‑person audio, the team explored a cost‑effective multi‑person solution in July, designing an architecture that could save over 99% of the usual Agora (声网) costs. The new design required users to install the latest client version to participate, while older versions could only watch, unintentionally setting the stage for future issues.
Development accelerated in August, with a preview released on August 22 and the official launch on August 24. Despite business advice to avoid non‑essential releases in September—a critical month for user renewals—the product team proceeded, and the first version of multi‑person audio went live on August 25.
Incident Analysis
Symptoms
During the incident, users with poor network conditions experienced echoing of the teacher's voice. This problem was not observed during testing and proved difficult to reproduce even after user reports.
Timeline
From August 26 to September 1, the team quickly identified the issue and applied a fix based on documentation, but QA could not reproduce it, so the fix was released. No classes were held between August 26‑31, so no feedback was received. On September 31 (a small “private” class) and September 1, users again reported the echo. The team discovered a problem with the Agora SDK interface, contacted Agora, and began investigating.
At this point, discussions about rolling back the code began, but the extensive architectural changes meant the original single‑person audio implementation had been removed, making a server‑side rollback costly and impractical. Additionally, there was no effective gray‑release mechanism: the live user base mixed new and old versions, and the existing gray‑release process relied on manual Excel tracking, which could not scale. The incident also highlighted shortcomings in testing methodology, which eventually uncovered relevant metrics in Agora’s monitoring dashboard.
Root Causes
During design, the team failed to devise a clear compatibility plan for old and new client versions.
Testing could not simulate the problematic environment, forcing reliance on online testing.
Ambiguous documentation from the Agora SDK required repeated clarification with the vendor.
Lack of proper project planning and an effective gray‑release strategy amplified the impact.
Inability to roll back code left all clients in a dead‑end.
Insufficient risk awareness among involved personnel.
Improvement Plan and Reflections
Plan
Establish client hot‑update capabilities.
Enhance server‑side gray‑release mechanisms.
Implement rigorous project planning and enforce gray‑release procedures for every version.
Reflections
Applying the 80/20 principle, the author notes that 20% of the problem stemmed from personnel skill gaps and process deficiencies, while 80% resulted from a lack of stability awareness and improper fault‑handling. For critical periods like September renewals, the team should avoid full‑scale rollouts, adopt comprehensive gray‑release plans, and execute immediate rollbacks when severe issues arise.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
