Zero‑Downtime Secrets: TT Voice’s Multi‑Cloud, AIOps & Resource Optimization
During the 2022 TT Voice Annual Summit, the technical team tackled stability, real‑time risk control, and resource utilization challenges by implementing strict change management, multi‑cloud high‑availability networking, AIOps‑driven monitoring, big‑data processing, and cloud‑native scaling strategies, ultimately delivering zero‑fault operation.
1. Strong Stability Buff
Three main challenges: handling failures caused by release changes (over 70% of incidents), ensuring continuous availability of cross‑cloud network links, and rapid fault response.
1.1 Strengthen Change Management
Adopt systematic change control (schedule releases away from peak traffic, use gray‑release, define acceptance and rollback procedures) and enforce strict review, notification, and high‑risk command auditing, which eliminated change‑induced faults during the event.
1.2 Ensure Cross‑Cloud Link High Availability
Design multiple dedicated lines and VPN backups; when a dedicated line fails, traffic automatically switches to VPN, and reverts when restored. Features include BGP‑based sub‑second failover, VPN load‑balancing, and automatic traffic restoration.
Result: high‑availability inter‑cloud connectivity, a leading solution in the industry.
1.3 Improve Operations Assurance
Introduce AIOps monitoring with text recognition and NLP to detect user‑side issues early, enhancing fault detection efficiency.
Promote fault‑handling awareness with “communicate and synchronize, stay calm” and “recover quickly, then root‑cause” strategies.
Conduct cloud‑vendor risk identification, discovering 41 risks and mitigating 36 before the event.
2. Real‑Time Risk Control Buff
Compliance requires 99.5% of requests to respond within 200 ms. The big‑data team ensures this via compute‑storage separation, data tiering, and service degradation.
2.1 Compute‑Storage Separation
Deploy compute workloads in Docker containers on Kubernetes, decouple compute logic from resources, and store state in object storage, enabling seamless migration and scaling.
2.2 Data Tiering
Hot user features reside in in‑memory KV stores; cold data in distributed NoSQL KV stores. Real‑time updates use micro‑batch pipelines to memory KV, while less‑time‑critical data updates in batch to NoSQL, improving query performance.
2.3 Service Degradation
Decouple feature calculation, update, and risk‑identification services. Use asynchronous architecture, fallback features, and distributed processing to maintain availability even if KV stores fail.
During the summit, zero incidents occurred and 99.9% of services responded within 50 ms.
3. High Resource Utilization Buff
Increase pod and node utilization while maintaining stability.
3.1 Application Resource Utilization
3.1.1 Elastic Scaling
95% of applications support horizontal pod autoscaling. Adjust HPA thresholds and resource requests to avoid excessive scaling.
Case: a pod with misconfigured HPA caused frequent scaling; after modeling resource requests, CPU utilization reached 57% during the summit.
3.1.2 Scheduled Scaling
Implement timed scaling before traffic spikes, reducing manual provisioning and improving resource efficiency. Graphs illustrate load before and after optimization.
Figure 1: Application load trend before optimization.
Figure 2: Load after scheduled scaling.
Figure 3: Interface metric after scaling.
3.2 Node Allocation Rate
3.2.1 Cluster Node Elastic Scaling (Scheduled)
Enable cloud provider auto‑scaling; however, node provisioning can lag behind traffic peaks. Scheduled scaling at 20:00 and 00:00 pre‑emptively adds nodes, eliminating transient failures.
Figure 4: Node scaling lag during peak.
Figure 5: Service impact due to delayed scaling.
Figure 6: Node allocation trend after scheduled scaling.
3.2.2 Optimizing Instance Types
Match pod CPU/MEM request ratios to node capacities and actual usage to reduce fragmentation, raising overall node allocation to over 85%.
3.3 Offline‑Online Co‑Location
Separate offline and online workloads in time, using a custom scheduler that dynamically allocates idle node resources to offline jobs while preserving online SLA, achieving node CPU utilization above 40% without large‑scale co‑location.
Node CPU average usage exceeded 40%, approaching the practical limit.
4. Summary and Outlook
With strong stability, real‑time risk control, and high resource utilization buffs, the team delivered zero‑fault operation during the summit, though opportunities remain in end‑to‑end delivery smoothness, fault quantity control, metric analysis, and data‑driven improvement.
Future work will focus on incremental innovation, faster iteration, and cross‑team collaboration to continuously enhance technical capabilities.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.