How Alibaba Built a Web‑Based Short Video Editor: Front‑End Insights

This article details Alibaba’s front‑end engineer’s approach to building a web‑based short video editor, covering the motivation, design principles, three‑layer architecture, script protocol, immutable data handling, audio‑video processing with WebCodecs and FFmpeg, rendering pipeline, and challenges of browser implementation.

Alibaba Terminal Technology
Alibaba Terminal Technology
Alibaba Terminal Technology
How Alibaba Built a Web‑Based Short Video Editor: Front‑End Insights

Attachment: D2 Front‑End Forum Video

Alibaba’s Xiaomi team needed a way to answer user questions quickly with video content, especially for older users who find text cumbersome. They envisioned a tool that could generate video drafts from structured knowledge, allowing users to make minor adjustments before publishing.

Background

Text‑based answers are often long and hard to scan, while video delivers higher information density and better engagement.

Why Build a Web‑Based Short Video Editor?

The team wanted a solution that leverages existing PPT skills, offers customizable templates, and provides intelligent script assistance, all implemented purely on the front end and rendered in the browser.

Design Goals

Leverage PPT experience for rapid onboarding.

Provide rich, customizable templates for various scenarios.

Offer smart script assistance.

Key editor capabilities include timeline editing, subtitle arrangement, animation stickers, transitions, filters, effects, TTS dubbing, and ASR speech recognition.

Editor Architecture

The editor is organized into three layers from top to bottom: Application layer, Engine layer, and Dependency layer.

Application layer: material library, template library, intelligent scripts, script models, etc.

Engine layer: resource manager, director engine, stage, renderer, and services.

Dependency layer: view, state management, animation/effects, audio‑video processing.

Editor Design Details

Users upload assets, drag‑and‑drop, configure, edit, and preview within the editor, then render the final video. Below the editor, uploaded assets are described, loaded, parsed, and cached. Editing updates the script model; exporting parses the script protocol and renders the animation.

Script Protocol Design

The script protocol is a complex nested structure that defines how assets are sequenced and rendered.

Immutable vs. Mutable Data Structures

For the editor, an immutable snapshot approach using Immer was chosen to enable efficient data updates while allowing node reuse to save memory.

Implementation Workflow

The renderer receives AudioStream and VideoStream from the editor. A resource manager loads and caches assets; the stage built on Konva and Canvas enables drag‑and‑drop editing. The director engine sends the script model to the resource manager for pre‑loading and drives the stage during playback.

Director Engine

The director engine reads the script, sorts elements, queues them for pre‑loading, and triggers playback based on a high‑precision internal clock, using the Web Audio API for accurate audio timing.

Audio Processing and Rendering

Audio/video processing involves three steps: demuxing, decoding, and rendering with synchronization.

1. Demuxing

Extract compressed streams (AAC/MP3/AC‑3 for audio, H.264/H.265/MPEG2 for video) from containers.

2. Decoding

Decode video to pixel data (YUV/RGB) and audio to PCM.

3. Rendering & Sync

Render video frames to WebGL textures on Canvas and play PCM via AudioWorklet, synchronizing playback using audio timestamps.

Traditional implementation uses FFmpeg compiled to WebAssembly, but this incurs high memory and load costs.

WebCodecs API provides native decoding/encoding in the browser, though it lacks demuxing, which can be handled by MP4Box.js, mux.js, or FFmpeg.

Rendering Output

Standard rendering captures video frames from Canvas and combines them with audio streams via MediaRecorder to produce WebM, then converts to MP4. Using WebCodecs, the MediaRecorder step can be bypassed, allowing direct encoding to the desired format.

Browser Implementation Challenges

While browsers already support animation (WebGL), audio‑video processing (WebCodecs, FFmpeg), and WebAssembly, several challenges remain for a professional editor:

1. File Handling

Large video files exceed IndexedDB limits; future solutions may use the File System Access API.

2. Long‑Video Rendering

MediaRecorder requires real‑time rendering; splitting videos or using asynchronous server‑side rendering can mitigate wait times.

3. Format Support

Additional demuxer/muxer libraries are needed; in the absence of native APIs, FFmpeg compiled to WebAssembly remains a viable approach.

Overall, a web‑based short video editor is feasible and offers ample opportunities for future optimization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

architectureWebAssemblyWebCodecsAudio Processingweb video editor
Alibaba Terminal Technology
Written by

Alibaba Terminal Technology

Official public account of Alibaba Terminal

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.