Migrating QQ Image Service to Tencent Cloud Native (TKE): Architecture, Optimization, and Lessons Learned
The QQ image storage platform was fully migrated from VM‑based servers to Tencent Cloud’s Kubernetes Engine, consolidating services into containers, adding health checks, anti‑affinity, and autoscaling, which cut costs by 26%, reduced ops effort 30%, and improved scalability and reliability.
The QQ image storage platform, a self‑developed solution for handling rich media (images, videos, audio, files) in QQ chats, was migrated to Tencent Cloud to improve stability, scalability, and cost efficiency. The migration used QQ chat images as a representative workload to illustrate the entire process and subsequent optimizations.
In the first phase, the existing TVM instances were replaced with Tencent Cloud CVM, and external services were upgraded to public CLB. Later, the team moved the core modules to the TKE (Tencent Kubernetes Engine) platform, consolidating all critical components into containers for a full‑link cloud solution.
Social media workloads exhibit strong peak traffic during evenings and holidays. Sudden spikes (e.g., a viral video causing a 2× increase in upload/download traffic) exposed the limitations of CVM‑based scaling, which could not expand quickly enough. The platform therefore required a design that could handle high concurrency while keeping CPU utilization low.
To standardize the container environment, a unified base image was built on the tlinux team’s standard image. The Dockerfile embeds a docker_run.sh script that sequentially starts all agents, copies configuration files, installs the business program, and keeps the container alive with tail -f /dev/null . The script content is:
#!/bin/bash
sh /usr/local/all-agent-start.sh
sh /etc/rainbow-agent/rainbow-agent-pull.sh
project_name="http"
project_path="/usr/local/storage/${project_name}"
chmod -R 755 ${project_path}
cd ${project_path}/tools/op && ./install.sh
printenv >> /etc/environment
tail -f /dev/nullDuring the migration, the YAML templates automatically generated a templatePool section, which could become unwieldy. The team added the following configuration to clean up unused templates:
autoDeleteExceededMapping: true
autoDeleteUnusedTemplate: trueImage pull policy was tuned to avoid unnecessary re‑pulls. For version‑specific images (e.g., xx:v_20230711 ) the policy IfNotPresent was used, while generic :latest tags kept the default Always . The relevant snippet:
imagePullPolicy: IfNotPresentHealth checks were added to ensure zero‑downtime during pod recreation. Liveness probes guarantee that only ready pods receive traffic, readiness probes filter out unhealthy pods, and a startup probe was configured for modules with long initialization times.
Container placement cost was high because some compression modules ran on low‑cost shared clusters, leading to stability and operational overhead. The solution was to converge all workloads onto TKE, eliminating fragmented clusters.
CPU utilization of the image module was low (≈20% during normal operation, ≤50% at peak). By moving to TKE, the team aimed to improve utilization and reduce waste.
Pod anti‑affinity was introduced to disperse pods across different nodes, preventing CLB stream‑blocking caused by many pods sharing the same host. The anti‑affinity rule used:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: k8s-app
operator: In
values:
- http
topologyKey: kubernetes.io/hostname
weight: 100Horizontal Pod Autoscaler (HPA) configurations were tuned per workload, based on CPU load or inbound/outbound traffic, ensuring that CPU peaks stayed above 50% while scaling down during off‑peak hours to cut costs.
To better fit the shared‑cluster model, the team adapted workloads to small‑core instances. Resource limits/requests were aligned (requests = limits) to avoid pre‑emptive eviction, and several workloads were down‑scaled to 4‑core, 2‑core, or even 1‑core configurations. Example resource spec:
resources:
limits:
cpu: "4"
memory: 8Gi
networkbandwidth.tkex.woa.com/size: 1500Mi
teg.tkex.oa.com/amd-cpu: 4k
tke.cloud.tencent.com/eni-ip: "1"
requests:
cpu: "4"
memory: 8Gi
networkbandwidth.tkex.woa.com/size: 1500Mi
teg.tkex.oa.com/amd-cpu: 4k
tke.cloud.tencent.com/eni-ip: "1"The team also explored AVIF (AV1 Image Format) to reduce bandwidth. AVIF offers 20‑30% higher compression than H.265‑based JPG at comparable quality. However, conversion to AVIF increased CPU time by sixfold, prompting a careful trade‑off: higher CPU cost was accepted to halve image size and save bandwidth.
Graceful termination was standardized with terminationGracePeriodSeconds: 75 to allow sufficient time for pods to finish in‑flight requests before being killed.
Overall, the migration achieved zero incidents, a 26% reduction in resource cost (replacing fixed‑size CVM with fine‑grained containers), a 30% reduction in operational labor, and improved scalability through multi‑AZ, multi‑workload disaster recovery, pod anti‑affinity, and HPA‑driven elasticity.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.