How Baidu Built a Scalable AIGC Video Production Engine with State‑Machine Orchestration
This article details Baidu's end‑to‑end AIGC video production pipeline, explaining the business drivers, the challenges of automating script‑to‑video conversion, the service‑orchestration architecture based on state‑machine scheduling, module‑component decomposition, configuration files, and the practical workflow that now supports tens of thousands of videos per day.
Background
Short‑form video consumption continues to grow in China, with over 10.5 billion internet users as of June 2022. Baidu’s Baijiahao platform hosts massive text‑based content, prompting the need for an AI‑driven solution that can automatically transform articles into videos, reducing creator effort and cost.
Challenges
After launching the AIGC project, Baidu experimented with various video‑generation approaches and settled on a "collect‑edit" (采编式) workflow that treats text as a script and automatically assembles video and image assets. Key technical challenges include:
Coordinating dozens of microservices without manual intervention.
Supporting rapid business‑logic changes while keeping the system stable.
Ensuring extensibility for future feature additions.
Orchestration Approaches
Typical service‑orchestration solutions rely on scheduled jobs, message queues, and persistent storage to chain microservices. Each microservice reports its execution state to a database (e.g., MySQL), and a scheduler drives the overall flow based on these states.
State‑Machine Scheduling
The chosen approach models the workflow as a finite‑state machine. Each component occupies a two‑bit slot within a 64‑bit integer (UINT64). The low bit represents success, the high bit represents failure. By manipulating these bits, the scheduler can atomically track the status of up to 32 components.
Module and Component Design
The workflow is divided into four logical modules:
Text Processing – extracts the main topic, determines tone, generates subtitles and voice‑over scripts.
Material Handling – searches, trims, and cleans video/image assets.
Audio Generation – synthesizes speech from the script.
Video Synthesis – merges all elements, adds watermarks, effects, and background music.
Each module contains reusable components. For example, the ScriptAssign component (text processing) depends on TextUnderstanding and WidgetInit, and its slot index is 2 (bits 4‑5 of the UINT64).
{
"module_name": "ScriptAssign",
"status": "init",
"next_status": "generating",
"components": [
{
"component_name": "TextProcessor",
"slot_index": 2,
"slot_num_success": 16,
"slot_num_fail": 32,
"depends": ["TextUnderstanding", "WidgetInit"]
},
{
"component_name": "FootageGenerator",
"slot_index": 3,
"slot_num_success": 64,
"slot_num_fail": 128,
"depends": ["TextUnderstanding", "WidgetInit", "TextProcessor"]
}
// ... other components
]
}Configuration File Format
The workflow description file lists components in execution order, each specifying its slot index, success/failure bit masks, and dependency list. Modules are executed sequentially, while components within a module may run in parallel if their dependencies are satisfied.
Process Scheduling Steps
Task Creation : After input validation, the task is persisted and a message is sent to the scheduling queue.
Executable Component Discovery : Workers pull messages, read the configuration file, compute the current UINT64 state, and enqueue any component whose dependencies are satisfied.
Asynchronous Callback : Most components are asynchronous microservices; callbacks update the task’s slot values and re‑enqueue the task for further processing.
Conclusion
Since its launch in May 2022, the collect‑edit AIGC video pipeline has powered five distinct production flows, handling tens of thousands of videos daily. The modular, slot‑based state management enables rapid feature expansion without touching the core scheduler. Ongoing work focuses on further stabilizing the system and evaluating mature workflow engines such as Cadence, Temporal, and Conductor for future upgrades.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
