How to Build a Low‑Cost, High‑Concurrency Distributed Video Transcoding System on AWS
This article explains the design of a distributed video transcoding platform that leverages AWS Lambda and EC2‑SLAVE to achieve high‑concurrency, low‑latency streaming, detailing architecture, load balancing, health checks, task monitoring, and cost‑saving strategies for scalable cloud‑based video processing.
Overview
Video transcoding is the core technology used in the online video industry. It compresses high‑resolution videos into multiple resolutions for adaptive streaming on various devices. Videos are sliced, each fragment is transcoded into several resolutions, and then merged into a streaming format.
Transcoding is necessary but costly; third‑party services like AWS Transcoder are expensive and have limited performance and flexibility. In addition to transcoding, CDN, storage, and bandwidth costs can be high, especially for large videos.
To address these challenges we built a distributed transcoding system that reduces costs, enables flexible scaling, and improves speed. It supports high‑concurrency multi‑video transcoding, adaptive adjustment, and per‑video bitrate optimization, saving bandwidth and delivering smooth streaming.
Video Download and Access
High Concurrency
For distributed transcoding each video segment is requested concurrently by many AWS‑Lambda or EC2 compute units. Lambda can scale to thousands of workers instantly, but S3 cannot handle simultaneous segment requests, causing latency. We therefore deploy our own file server to provide stable, high‑concurrency fragment access.
Minimum Disk Usage
We use HTTP‑RANGE to request only needed fragments, and FFMPEG supports RANGE mode. Multiple file servers can host copies of a video, but we store a single copy per server to avoid redundant downloads and storage. High‑performance servers replace many low‑performance nodes, reducing network and disk load.
Intelligent Load Balancing
File servers monitor their own transcoding queues and predict future load. A server expected to become heavily loaded voluntarily yields new download requests to less‑loaded servers, achieving a polite, self‑balancing distribution without a central dispatcher.
Automatic Health Checking
Each file server continuously monitors CPU, memory, network, and disk usage. If load exceeds a threshold, the server delays new downloads, ensuring high‑priority transcoding tasks are not affected and protecting the server from overload.
Timely Disk Recycling
Large source videos (up to 200 GB) are stored on high‑speed SSDs. Disk space is limited, so the system monitors video usage and automatically removes videos no longer needed, freeing space for upcoming tasks while prioritizing active transcoding.
Split Main Task and Trigger Sub Task
Split Main Task
Video transcoding is CPU‑intensive. We split each main transcoding job (a specific resolution/bitrate) into many small sub‑tasks, each handling a video segment, allowing concurrent processing across multiple compute units.
Trigger Sub Task
Sub‑tasks are placed in an internal queue, then dispatched to EC2‑SLAVE or AWS‑Lambda based on current resource availability, avoiding the need to keep a large pool of idle EC2 instances.
Monitor Everything
Machine Level Resource Monitoring
AWS CloudWatch tracks real‑time resource usage of transcoding machines and the master node, primarily for historical analysis.
Business Level Machine Load Monitoring
File servers report load in real time; the upstream module uses this to decide when to trigger additional transcoding units. EC2 workers also report load to adjust their task intake.
Queue Level Congestion Monitoring
Message queues control module interactions. If a queue becomes congested, upstream distribution rates are throttled, and load‑balancing strategies (preemption or politeness) are applied.
Task Level Status Monitoring
Each transcoding task records state transitions, enabling real‑time observation of main and sub‑tasks, cost calculation, automatic retries for failed sub‑tasks, and overall system health checks.
Main Task Monitoring
State transitions for a main task: Ready → Downloading → Downloaded → Accepted (split into sub‑tasks) → Running → Succeeded or Failed.
Sub Task Monitoring
Sub‑task states: Ready → CRFRunning → CRFSucceeded/CRFFailed/CRFTimeout → CRFReady → Running → Succeeded → Failed/Timeout (which may cause the main task to fail).
Low Cost Transcoding
AWS‑LAMBDA
Sub‑tasks are CPU‑intensive but short‑lived; AWS Lambda provides abundant low‑cost CPU cycles by utilizing idle resources, charging only for execution time.
EC2‑SLAVE
Our custom EC2‑SLAVE module runs on regular EC2 instances, monitors host resources, and opportunistically executes sub‑tasks when CPU is idle, preserving host performance while maximizing free compute capacity.
Dispatching Control
Excessive Lambda or EC2‑SLAVE usage can overload the file server, increasing latency and cost. Dispatch rates are throttled based on file‑server load reports to maintain high concurrency and low latency.
High Scale Ability
Both the file server and EC2 instance count can auto‑scale according to queue length, easily handling large or small workloads and meeting diverse business needs.
Summary
This article focused on how our distributed transcoding system stabilizes high‑concurrency video services while reducing computing costs. The next article will cover bitrate optimization methods that further lower bandwidth expenses and improve user experience.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MXPlayer Technical Team
Technical articles and experience sharing. MXPLAYER is the top-ranked online video content platform in India, and also the world's largest player app, with 100M+ DAU and hundreds of millions of MAU.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
