Cloud Computing 23 min read

How GPU, VPU, and CPU Accelerate Cloud Video Transcoding: Architecture and Best Practices

This article explores the rapid growth of video traffic, explains why transcoding is essential, compares CPU, GPU, and VPU hardware for video processing, details the FFmpeg software stack, describes the design of a cloud‑native transcoding cluster, its scheduling, shard‑transcoding technique, and presents performance test results.

Qingyun Technology Community
Qingyun Technology Community
Qingyun Technology Community
How GPU, VPU, and CPU Accelerate Cloud Video Transcoding: Architecture and Best Practices

Introduction

With video content now accounting for over 80% of internet traffic, transcoding has become a fundamental requirement for audio‑video services. This article shares QingCloud's experience in accelerating video transcoding using heterogeneous hardware such as GPUs and VPUs.

Why Transcoding Is a Must

Production of audio‑video streams often generates a single format, while consumption must adapt to diverse network conditions, device capabilities, and user preferences (e.g., multi‑screen, watermarking, up‑scaling). Rapid evolution of codecs (H.265, VP9, AV1) further drives the need for flexible transcoding pipelines.

Typical Transcoding Scenarios

Enterprise or personal transcoding software (e.g., video editors, format converters).

Cloud SaaS transcoding services embedded in VOD, live‑streaming, or standalone PaaS offerings.

Hardware Options

CPU

CPU offers high compatibility and software‑based decoding/encoding for all codecs, but its parallelism is limited, leading to low throughput and higher latency for video‑intensive tasks.

GPU

GPU provides massive parallel compute units and dedicated hardware codecs, delivering high concurrency for video transcoding, but consumes more power and requires careful cooling and PCIe slot planning.

VPU

VPU is a dedicated video processing chip with built‑in codec acceleration and optional AI engines for quality enhancement. It offers low power consumption and high concurrency, though it typically supports only H.264/H.265 and lacks the flexibility of GPU for AI workloads.

Software Stack

QingCloud builds on the FFmpeg multimedia framework. The stack includes:

Tools : ffmpeg (command‑line transcoder), ffprobe (media metadata), ffplay (player).

Core libraries : libavcodec (codec operations), libavfilter (filters such as crop, scale, watermark), libavformat (container handling).

Hardware‑acceleration SDKs : vendor‑provided plugins for Intel/NVIDIA GPUs and VPU chips, enabling FFmpeg to invoke heterogeneous accelerators uniformly.

Transcoding Task Classification

Tasks are divided into video‑centric (e.g., video transcoding, watermark, adaptive bitrate) and non‑video (audio transcoding, clipping, frame extraction, split/merge). Video‑centric tasks prefer GPU/VPU for speed, while non‑video tasks run efficiently on CPU.

Cluster Architecture

The transcoding cluster consists of a gateway, a control plane, and multiple workers. Users submit jobs via the cloud‑VOD console, SDK, or API; media files are stored in QingCloud Object Storage. The controller receives jobs, selects appropriate hardware based on task type and worker load, and dispatches work to workers.

Concurrency Scheduling

Workers periodically report hardware load (CPU/GPU/VPU utilization, encoder/decoder slots) and system metrics (CPU, memory, bandwidth). The controller assigns tasks to the least‑loaded workers, using a high‑priority queue for split‑transcoding jobs.

Shard (Fragment) Transcoding

Long‑duration videos are split into multiple fragments based on key‑frames. Each fragment is transcoded in parallel on different workers, then merged into a final output. This horizontal scaling reduces total processing time without requiring more powerful single nodes.

Technical Principles of Shard Transcoding

Split the source file into time‑based fragments.

Transcode each fragment with identical codec parameters, possibly on different workers.

Merge all fragment results into a single file.

Optionally segment the merged file for HLS/DASH delivery.

Execution Flow

User initiates a transcoding request.

Gateway validates the request and forwards it to the controller.

Controller determines task type, selects hardware, and decides whether to shard.

If sharding is needed, a worker performs the split (CPU‑only).

Split fragments are distributed to workers for parallel transcoding.

Workers report progress; the controller may replace lagging workers.

When all fragments finish, a designated MainWorker merges them (CPU) and uploads the result.

Distributed Coordination

The controller monitors progress of each worker. If a worker falls behind, a redundant worker is launched to finish the fragment, and the slower worker is cancelled once the fragment is completed.

Parallel Pipeline Design

Three main stages—download, transcode, upload—are pipelined. Multiple downloads can run concurrently, feeding a queue of ready‑to‑transcode files; transcoding runs in parallel on available hardware; uploads are handled by a dedicated process, ensuring full utilization of network bandwidth and compute resources.

Logging and Alerting

Extensive use of public‑cloud components (message queues, databases, object storage) is complemented by ELK logging and alerting. JSON log fields are indexed for fast queries (e.g., by transcoding request ID). Kibana dashboards display worker load, request counts, and lag metrics.

Performance Tests

Test environment: Ubuntu 20.04, FFmpeg 4.4, 2×Intel Xeon 16‑core CPUs, 1×data‑center GPU (PCIe x16), 4×VPU cards (combined PCIe x16). Tests compared single‑task latency, concurrent lane count, and power consumption across CPU, GPU, and VPU. Results show GPU delivers the highest raw performance, VPU offers the best performance‑per‑watt, and CPU provides universal compatibility.

Conclusion

QingCloud’s transcoding solution combines heterogeneous hardware acceleration, a modular FFmpeg‑based software stack, and a scalable cloud‑native architecture. Shard transcoding and pipeline parallelism enable efficient processing of long‑duration videos, while the scheduling and monitoring mechanisms ensure high availability and optimal resource utilization.

distributed-systemscloud computingGPU accelerationffmpegHardware AccelerationVideo TranscodingVPU
Qingyun Technology Community
Written by

Qingyun Technology Community

Official account of the Qingyun Technology Community, focusing on tech innovation, supporting developers, and sharing knowledge. Born to Learn and Share!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.