Operations 15 min read

Zero‑Downtime Secrets: TT Voice’s Multi‑Cloud, AIOps & Resource Optimization

During the 2022 TT Voice Annual Summit, the technical team tackled stability, real‑time risk control, and resource utilization challenges by implementing strict change management, multi‑cloud high‑availability networking, AIOps‑driven monitoring, big‑data processing, and cloud‑native scaling strategies, ultimately delivering zero‑fault operation.

Efficient Ops
Efficient Ops
Efficient Ops
Zero‑Downtime Secrets: TT Voice’s Multi‑Cloud, AIOps & Resource Optimization

1. Strong Stability Buff

Three main challenges: handling failures caused by release changes (over 70% of incidents), ensuring continuous availability of cross‑cloud network links, and rapid fault response.

1.1 Strengthen Change Management

Adopt systematic change control (schedule releases away from peak traffic, use gray‑release, define acceptance and rollback procedures) and enforce strict review, notification, and high‑risk command auditing, which eliminated change‑induced faults during the event.

1.2 Ensure Cross‑Cloud Link High Availability

Design multiple dedicated lines and VPN backups; when a dedicated line fails, traffic automatically switches to VPN, and reverts when restored. Features include BGP‑based sub‑second failover, VPN load‑balancing, and automatic traffic restoration.

Result: high‑availability inter‑cloud connectivity, a leading solution in the industry.

1.3 Improve Operations Assurance

Introduce AIOps monitoring with text recognition and NLP to detect user‑side issues early, enhancing fault detection efficiency.

Promote fault‑handling awareness with “communicate and synchronize, stay calm” and “recover quickly, then root‑cause” strategies.

Conduct cloud‑vendor risk identification, discovering 41 risks and mitigating 36 before the event.

2. Real‑Time Risk Control Buff

Compliance requires 99.5% of requests to respond within 200 ms. The big‑data team ensures this via compute‑storage separation, data tiering, and service degradation.

2.1 Compute‑Storage Separation

Deploy compute workloads in Docker containers on Kubernetes, decouple compute logic from resources, and store state in object storage, enabling seamless migration and scaling.

2.2 Data Tiering

Hot user features reside in in‑memory KV stores; cold data in distributed NoSQL KV stores. Real‑time updates use micro‑batch pipelines to memory KV, while less‑time‑critical data updates in batch to NoSQL, improving query performance.

2.3 Service Degradation

Decouple feature calculation, update, and risk‑identification services. Use asynchronous architecture, fallback features, and distributed processing to maintain availability even if KV stores fail.

During the summit, zero incidents occurred and 99.9% of services responded within 50 ms.

3. High Resource Utilization Buff

Increase pod and node utilization while maintaining stability.

3.1 Application Resource Utilization

3.1.1 Elastic Scaling

95% of applications support horizontal pod autoscaling. Adjust HPA thresholds and resource requests to avoid excessive scaling.

Case: a pod with misconfigured HPA caused frequent scaling; after modeling resource requests, CPU utilization reached 57% during the summit.

3.1.2 Scheduled Scaling

Implement timed scaling before traffic spikes, reducing manual provisioning and improving resource efficiency. Graphs illustrate load before and after optimization.

Figure 1: Application load trend before optimization.

Figure 2: Load after scheduled scaling.

Figure 3: Interface metric after scaling.

3.2 Node Allocation Rate

3.2.1 Cluster Node Elastic Scaling (Scheduled)

Enable cloud provider auto‑scaling; however, node provisioning can lag behind traffic peaks. Scheduled scaling at 20:00 and 00:00 pre‑emptively adds nodes, eliminating transient failures.

Figure 4: Node scaling lag during peak.

Figure 5: Service impact due to delayed scaling.

Figure 6: Node allocation trend after scheduled scaling.

3.2.2 Optimizing Instance Types

Match pod CPU/MEM request ratios to node capacities and actual usage to reduce fragmentation, raising overall node allocation to over 85%.

3.3 Offline‑Online Co‑Location

Separate offline and online workloads in time, using a custom scheduler that dynamically allocates idle node resources to offline jobs while preserving online SLA, achieving node CPU utilization above 40% without large‑scale co‑location.

Node CPU average usage exceeded 40%, approaching the practical limit.

4. Summary and Outlook

With strong stability, real‑time risk control, and high resource utilization buffs, the team delivered zero‑fault operation during the summit, though opportunities remain in end‑to‑end delivery smoothness, fault quantity control, metric analysis, and data‑driven improvement.

Future work will focus on incremental innovation, faster iteration, and cross‑team collaboration to continuously enhance technical capabilities.

Cloud NativeoperationsMulti-Cloudresource optimizationAIOps
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.