Cloud Native 15 min read

Migrating QQ Image Service to Tencent Cloud Native (TKE): Architecture, Optimization, and Lessons Learned

The QQ image storage platform was fully migrated from VM‑based servers to Tencent Cloud’s Kubernetes Engine, consolidating services into containers, adding health checks, anti‑affinity, and autoscaling, which cut costs by 26%, reduced ops effort 30%, and improved scalability and reliability.

Tencent Cloud Developer

Aug 16, 2023

Migrating QQ Image Service to Tencent Cloud Native (TKE): Architecture, Optimization, and Lessons Learned

The QQ image storage platform, a self‑developed solution for handling rich media (images, videos, audio, files) in QQ chats, was migrated to Tencent Cloud to improve stability, scalability, and cost efficiency. The migration used QQ chat images as a representative workload to illustrate the entire process and subsequent optimizations.

In the first phase, the existing TVM instances were replaced with Tencent Cloud CVM, and external services were upgraded to public CLB. Later, the team moved the core modules to the TKE (Tencent Kubernetes Engine) platform, consolidating all critical components into containers for a full‑link cloud solution.

Social media workloads exhibit strong peak traffic during evenings and holidays. Sudden spikes (e.g., a viral video causing a 2× increase in upload/download traffic) exposed the limitations of CVM‑based scaling, which could not expand quickly enough. The platform therefore required a design that could handle high concurrency while keeping CPU utilization low.

To standardize the container environment, a unified base image was built on the tlinux team’s standard image. The Dockerfile embeds a docker_run.sh script that sequentially starts all agents, copies configuration files, installs the business program, and keeps the container alive with tail -f /dev/null. The script content is:

#!/bin/bash
sh /usr/local/all-agent-start.sh
sh /etc/rainbow-agent/rainbow-agent-pull.sh
project_name="http"
project_path="/usr/local/storage/${project_name}"
chmod -R 755 ${project_path}
cd ${project_path}/tools/op && ./install.sh
printenv >> /etc/environment
tail -f /dev/null

During the migration, the YAML templates automatically generated a templatePool section, which could become unwieldy. The team added the following configuration to clean up unused templates:

autoDeleteExceededMapping: true
autoDeleteUnusedTemplate: true

Image pull policy was tuned to avoid unnecessary re‑pulls. For version‑specific images (e.g., xx:v_20230711) the policy IfNotPresent was used, while generic :latest tags kept the default Always. The relevant snippet: imagePullPolicy: IfNotPresent Health checks were added to ensure zero‑downtime during pod recreation. Liveness probes guarantee that only ready pods receive traffic, readiness probes filter out unhealthy pods, and a startup probe was configured for modules with long initialization times.

Container placement cost was high because some compression modules ran on low‑cost shared clusters, leading to stability and operational overhead. The solution was to converge all workloads onto TKE, eliminating fragmented clusters.

CPU utilization of the image module was low (≈20% during normal operation, ≤50% at peak). By moving to TKE, the team aimed to improve utilization and reduce waste.

Pod anti‑affinity was introduced to disperse pods across different nodes, preventing CLB stream‑blocking caused by many pods sharing the same host. The anti‑affinity rule used:

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - podAffinityTerm:
        labelSelector:
          matchExpressions:
          - key: k8s-app
            operator: In
            values:
            - http
        topologyKey: kubernetes.io/hostname
      weight: 100

Horizontal Pod Autoscaler (HPA) configurations were tuned per workload, based on CPU load or inbound/outbound traffic, ensuring that CPU peaks stayed above 50% while scaling down during off‑peak hours to cut costs.

To better fit the shared‑cluster model, the team adapted workloads to small‑core instances. Resource limits/requests were aligned (requests = limits) to avoid pre‑emptive eviction, and several workloads were down‑scaled to 4‑core, 2‑core, or even 1‑core configurations. Example resource spec:

resources:
  limits:
    cpu: "4"
    memory: 8Gi
    networkbandwidth.tkex.woa.com/size: 1500Mi
    teg.tkex.oa.com/amd-cpu: 4k
    tke.cloud.tencent.com/eni-ip: "1"
  requests:
    cpu: "4"
    memory: 8Gi
    networkbandwidth.tkex.woa.com/size: 1500Mi
    teg.tkex.oa.com/amd-cpu: 4k
    tke.cloud.tencent.com/eni-ip: "1"

The team also explored AVIF (AV1 Image Format) to reduce bandwidth. AVIF offers 20‑30% higher compression than H.265‑based JPG at comparable quality. However, conversion to AVIF increased CPU time by sixfold, prompting a careful trade‑off: higher CPU cost was accepted to halve image size and save bandwidth.

Graceful termination was standardized with terminationGracePeriodSeconds: 75 to allow sufficient time for pods to finish in‑flight requests before being killed.

Overall, the migration achieved zero incidents, a 26% reduction in resource cost (replacing fixed‑size CVM with fine‑grained containers), a 30% reduction in operational labor, and improved scalability through multi‑AZ, multi‑workload disaster recovery, pod anti‑affinity, and HPA‑driven elasticity.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native Image Processing Kubernetes AVIF HPA performance-optimization TKE

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.