How GPU, VPU, and CPU Accelerate Cloud Video Transcoding: Architecture and Best Practices
This article explores the rapid growth of video traffic, explains why transcoding is essential, compares CPU, GPU, and VPU hardware for video processing, details the FFmpeg software stack, describes the design of a cloud‑native transcoding cluster, its scheduling, shard‑transcoding technique, and presents performance test results.
Introduction
With video content now accounting for over 80% of internet traffic, transcoding has become a fundamental requirement for audio‑video services. This article shares QingCloud's experience in accelerating video transcoding using heterogeneous hardware such as GPUs and VPUs.
Why Transcoding Is a Must
Production of audio‑video streams often generates a single format, while consumption must adapt to diverse network conditions, device capabilities, and user preferences (e.g., multi‑screen, watermarking, up‑scaling). Rapid evolution of codecs (H.265, VP9, AV1) further drives the need for flexible transcoding pipelines.
Typical Transcoding Scenarios
Enterprise or personal transcoding software (e.g., video editors, format converters).
Cloud SaaS transcoding services embedded in VOD, live‑streaming, or standalone PaaS offerings.
Hardware Options
CPU
CPU offers high compatibility and software‑based decoding/encoding for all codecs, but its parallelism is limited, leading to low throughput and higher latency for video‑intensive tasks.
GPU
GPU provides massive parallel compute units and dedicated hardware codecs, delivering high concurrency for video transcoding, but consumes more power and requires careful cooling and PCIe slot planning.
VPU
VPU is a dedicated video processing chip with built‑in codec acceleration and optional AI engines for quality enhancement. It offers low power consumption and high concurrency, though it typically supports only H.264/H.265 and lacks the flexibility of GPU for AI workloads.
Software Stack
QingCloud builds on the FFmpeg multimedia framework. The stack includes:
Tools : ffmpeg (command‑line transcoder), ffprobe (media metadata), ffplay (player).
Core libraries : libavcodec (codec operations), libavfilter (filters such as crop, scale, watermark), libavformat (container handling).
Hardware‑acceleration SDKs : vendor‑provided plugins for Intel/NVIDIA GPUs and VPU chips, enabling FFmpeg to invoke heterogeneous accelerators uniformly.
Transcoding Task Classification
Tasks are divided into video‑centric (e.g., video transcoding, watermark, adaptive bitrate) and non‑video (audio transcoding, clipping, frame extraction, split/merge). Video‑centric tasks prefer GPU/VPU for speed, while non‑video tasks run efficiently on CPU.
Cluster Architecture
The transcoding cluster consists of a gateway, a control plane, and multiple workers. Users submit jobs via the cloud‑VOD console, SDK, or API; media files are stored in QingCloud Object Storage. The controller receives jobs, selects appropriate hardware based on task type and worker load, and dispatches work to workers.
Concurrency Scheduling
Workers periodically report hardware load (CPU/GPU/VPU utilization, encoder/decoder slots) and system metrics (CPU, memory, bandwidth). The controller assigns tasks to the least‑loaded workers, using a high‑priority queue for split‑transcoding jobs.
Shard (Fragment) Transcoding
Long‑duration videos are split into multiple fragments based on key‑frames. Each fragment is transcoded in parallel on different workers, then merged into a final output. This horizontal scaling reduces total processing time without requiring more powerful single nodes.
Technical Principles of Shard Transcoding
Split the source file into time‑based fragments.
Transcode each fragment with identical codec parameters, possibly on different workers.
Merge all fragment results into a single file.
Optionally segment the merged file for HLS/DASH delivery.
Execution Flow
User initiates a transcoding request.
Gateway validates the request and forwards it to the controller.
Controller determines task type, selects hardware, and decides whether to shard.
If sharding is needed, a worker performs the split (CPU‑only).
Split fragments are distributed to workers for parallel transcoding.
Workers report progress; the controller may replace lagging workers.
When all fragments finish, a designated MainWorker merges them (CPU) and uploads the result.
Distributed Coordination
The controller monitors progress of each worker. If a worker falls behind, a redundant worker is launched to finish the fragment, and the slower worker is cancelled once the fragment is completed.
Parallel Pipeline Design
Three main stages—download, transcode, upload—are pipelined. Multiple downloads can run concurrently, feeding a queue of ready‑to‑transcode files; transcoding runs in parallel on available hardware; uploads are handled by a dedicated process, ensuring full utilization of network bandwidth and compute resources.
Logging and Alerting
Extensive use of public‑cloud components (message queues, databases, object storage) is complemented by ELK logging and alerting. JSON log fields are indexed for fast queries (e.g., by transcoding request ID). Kibana dashboards display worker load, request counts, and lag metrics.
Performance Tests
Test environment: Ubuntu 20.04, FFmpeg 4.4, 2×Intel Xeon 16‑core CPUs, 1×data‑center GPU (PCIe x16), 4×VPU cards (combined PCIe x16). Tests compared single‑task latency, concurrent lane count, and power consumption across CPU, GPU, and VPU. Results show GPU delivers the highest raw performance, VPU offers the best performance‑per‑watt, and CPU provides universal compatibility.
Conclusion
QingCloud’s transcoding solution combines heterogeneous hardware acceleration, a modular FFmpeg‑based software stack, and a scalable cloud‑native architecture. Shard transcoding and pipeline parallelism enable efficient processing of long‑duration videos, while the scheduling and monitoring mechanisms ensure high availability and optimal resource utilization.
Qingyun Technology Community
Official account of the Qingyun Technology Community, focusing on tech innovation, supporting developers, and sharing knowledge. Born to Learn and Share!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
