Design and Architecture of a New Video Review System with Streamlined Frame Extraction and Parallel Processing
This article presents the design goals, architecture, technology selection, and component details of a unified video review system that leverages FFmpeg for frame extraction, stream‑based parallel processing, and flexible synchronous/asynchronous workflows to achieve low latency and high scalability.
As advertising and short‑video content become increasingly important for user acquisition, video review systems must meet higher performance requirements; this article analyzes existing shortcomings and proposes a new unified video review system architecture to reduce operational costs while improving latency and protocol support.
Background and Goals – The legacy system consists of multiple solutions with inconsistent performance and high maintenance overhead. The new design aims to consolidate services, provide a standard API, support complete protocol suites, and significantly accelerate review turnaround.
Design Objectives – Optimize review latency by streaming download, frame extraction, inference, and notification; enable parallel processing within each stage; and support diverse interface protocols for short‑video, long‑video, and live‑stream scenarios.
Frame Extraction Technology Selection – FFmpeg is chosen for its cross‑platform support of over 200 codecs and 180 container formats. Both its API and command‑line interface are evaluated, with the CLI preferred for the video review use‑case due to its simplicity, stability, and ability to perform streaming download, segment‑wise parallel processing, and custom audio/video parameter configuration.
Streaming Processing Framework – The core task processor comprises a frame‑extraction engine, a task driver, and a review business object. The driver schedules parallel download, frame extraction, inference, and response stages, allowing multiple processor instances to handle concurrent video tasks based on CPU resources.
Frame Extraction Engine – Handles both image and audio frame extraction. Image frames are produced by launching multiple FFmpeg processes that seek and split the video into segments for parallel processing, while audio frames are extracted either as a whole file for ASR or dynamically segmented for live streams using segmenting, VAD, encoding, and collection steps.
Review Business Module – Interacts with the task processor and inference services, driving a single‑threaded loop that polls frame queues, performs asynchronous inference, and uploads results. It can emit real‑time responses when the strategy enables the real‑time switch.
Task Scheduler – Provides non‑blocking status and command interfaces for the frame engine and business module. A single‑threaded scheduler orchestrates streaming calls between the producer (frame engine) and consumer (business object) to complete the video review workflow.
Result Service – Consumes events from the main service via MQ, offering audit logs, active query APIs, and global retry mechanisms to ensure no task loss during host or container failures.
Strategy Configuration – Allows per‑video customization of processing pipelines and business rules, supporting both synchronous and asynchronous modes, and can be deployed in various cluster configurations.
Testing and Validation – In a 16‑core container environment, the system meets design targets: 1‑minute, 100 MB videos are reviewed within 2 seconds; long‑video asynchronous processing achieves a 5× speedup over the previous version; graceful shutdown and global retry guarantee task reliability; and modular interfaces lay the groundwork for future multi‑scenario distributed systems.
JD Retail Technology
Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.