Cloud Computing 8 min read

Elastic Computing Platform for Massive Image Compression and Multi‑Workload Services

The article describes how an elastic container‑based computing platform replaces tens of thousands of physical servers to deliver billions of daily image‑compression operations, while also supporting video transcoding, Spark jobs, and AI workloads through resource isolation, named services, dynamic scheduling, and load‑balancing techniques.

Architecture Digest
Architecture Digest
Architecture Digest
Elastic Computing Platform for Massive Image Compression and Multi‑Workload Services

The original image‑compression service relied on about 24,000 physical machines; the new elastic platform now uses only 6,000 containers to handle hundreds of billions of compressions per day, while also serving video transcoding, Spark, and AI (Go game, Honor of Kings) workloads. By reusing low‑load resources across the network, the platform aggregates up to 700,000 CPU cores with an average utilization of 56%.

Background: QQ Album, WeChat image sharing, and Moments generate nearly 100 billion images daily, creating a massive compression demand that runs on the TCS elastic computing platform.

Feasibility analysis of the previous mixed‑storage approach highlighted three major issues: low resource utilization (static provisioning leads to idle capacity during off‑peak periods), increased operations cost (frequent hardware provisioning and manual scaling), and business interference (CPU and memory contention between compression and storage services).

Platform solutions:

Resource Isolation – Containers use Docker and cgroup quotas (quota, share, period) for CPU and memory isolation, complemented by a dynamic CPU‑binding strategy that schedules containers onto low‑load cores.

Named Services – Instead of managing individual master modules for each compression pool, the platform provides a name‑based service that automatically attaches resources, offers load‑balancing, and handles fault removal, reducing operational overhead and speeding up developer onboarding.

Automatic Scheduling – The platform implements three scheduling modes based on real‑time metrics collected from containers (CPU, memory, disk I/O, network I/O):

1. Dynamic Scheduling – Expands resources within seconds when load exceeds a high threshold and contracts within minutes when load falls below a low threshold.

2. Abnormal Scheduling – Monitors CPI (Cycles Per Instruction) to detect abnormal CPU behavior; when CPI deviates from the model, the platform removes or replaces the affected containers.

3. Perception Scheduling – Uses business‑level indicators such as compression latency and failure rate; if these degrade while resource metrics remain normal, the platform demotes or replaces the problematic containers.

Load Balancing – Benchmarks are run on heterogeneous CPU models to compute a performance coefficient for each type; this coefficient drives resource weighting, with fine‑tuning based on observed latency and failure rates.

Summary and Outlook: By the end of the year the platform aims to schedule up to 1 million CPU cores, continuously mining idle capacity to provide low‑cost, high‑throughput compute for various services, aligning with the company’s AI initiatives and cost‑optimization goals.

image compressionCloud PlatformDynamic ScalingResource Isolationelastic computingcontainer scheduling
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.