How Alibaba Built a Web‑Based Short Video Editor: Front‑End Insights
This article details Alibaba’s front‑end engineer’s approach to building a web‑based short video editor, covering the motivation, design principles, three‑layer architecture, script protocol, immutable data handling, audio‑video processing with WebCodecs and FFmpeg, rendering pipeline, and challenges of browser implementation.
Attachment: D2 Front‑End Forum Video
Alibaba’s Xiaomi team needed a way to answer user questions quickly with video content, especially for older users who find text cumbersome. They envisioned a tool that could generate video drafts from structured knowledge, allowing users to make minor adjustments before publishing.
Background
Text‑based answers are often long and hard to scan, while video delivers higher information density and better engagement.
Why Build a Web‑Based Short Video Editor?
The team wanted a solution that leverages existing PPT skills, offers customizable templates, and provides intelligent script assistance, all implemented purely on the front end and rendered in the browser.
Design Goals
Leverage PPT experience for rapid onboarding.
Provide rich, customizable templates for various scenarios.
Offer smart script assistance.
Key editor capabilities include timeline editing, subtitle arrangement, animation stickers, transitions, filters, effects, TTS dubbing, and ASR speech recognition.
Editor Architecture
The editor is organized into three layers from top to bottom: Application layer, Engine layer, and Dependency layer.
Application layer: material library, template library, intelligent scripts, script models, etc.
Engine layer: resource manager, director engine, stage, renderer, and services.
Dependency layer: view, state management, animation/effects, audio‑video processing.
Editor Design Details
Users upload assets, drag‑and‑drop, configure, edit, and preview within the editor, then render the final video. Below the editor, uploaded assets are described, loaded, parsed, and cached. Editing updates the script model; exporting parses the script protocol and renders the animation.
Script Protocol Design
The script protocol is a complex nested structure that defines how assets are sequenced and rendered.
Immutable vs. Mutable Data Structures
For the editor, an immutable snapshot approach using Immer was chosen to enable efficient data updates while allowing node reuse to save memory.
Implementation Workflow
The renderer receives AudioStream and VideoStream from the editor. A resource manager loads and caches assets; the stage built on Konva and Canvas enables drag‑and‑drop editing. The director engine sends the script model to the resource manager for pre‑loading and drives the stage during playback.
Director Engine
The director engine reads the script, sorts elements, queues them for pre‑loading, and triggers playback based on a high‑precision internal clock, using the Web Audio API for accurate audio timing.
Audio Processing and Rendering
Audio/video processing involves three steps: demuxing, decoding, and rendering with synchronization.
1. Demuxing
Extract compressed streams (AAC/MP3/AC‑3 for audio, H.264/H.265/MPEG2 for video) from containers.
2. Decoding
Decode video to pixel data (YUV/RGB) and audio to PCM.
3. Rendering & Sync
Render video frames to WebGL textures on Canvas and play PCM via AudioWorklet, synchronizing playback using audio timestamps.
Traditional implementation uses FFmpeg compiled to WebAssembly, but this incurs high memory and load costs.
WebCodecs API provides native decoding/encoding in the browser, though it lacks demuxing, which can be handled by MP4Box.js, mux.js, or FFmpeg.
Rendering Output
Standard rendering captures video frames from Canvas and combines them with audio streams via MediaRecorder to produce WebM, then converts to MP4. Using WebCodecs, the MediaRecorder step can be bypassed, allowing direct encoding to the desired format.
Browser Implementation Challenges
While browsers already support animation (WebGL), audio‑video processing (WebCodecs, FFmpeg), and WebAssembly, several challenges remain for a professional editor:
1. File Handling
Large video files exceed IndexedDB limits; future solutions may use the File System Access API.
2. Long‑Video Rendering
MediaRecorder requires real‑time rendering; splitting videos or using asynchronous server‑side rendering can mitigate wait times.
3. Format Support
Additional demuxer/muxer libraries are needed; in the absence of native APIs, FFmpeg compiled to WebAssembly remains a viable approach.
Overall, a web‑based short video editor is feasible and offers ample opportunities for future optimization.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
