Text-Based Audio Editing in Cloud Editing: Architecture, Features, and Performance Optimizations
The article discusses cloud-based audio editing tool architecture, focusing on text‑based editing enabled by ASR, hierarchical DOM (Word, Sentence, Paragraph), performance challenges with massive character nodes, and optimizations like viewport‑based rendering and efficient drag‑select, achieving large speed gains for long recordings.
Creator's Maslow Hierarchy of Needs
According to incomplete statistics, creators' needs for audio editing tools can be roughly divided into several stages.
1. Basic Editing Needs
Recording, local upload, waveform drawing, multi‑track, audio cutting, drag‑and‑drop, playback preview, synthesis and export. These are essential capabilities for any audio editing tool, enabling anything from a single podcast episode to an entire audiobook series.
2. Efficiency‑Boosting Needs
Audio tagging, multi‑selection, reverb, audio tuning, cloud collaboration, etc. These features help heavy‑duty editors quickly locate highlights, batch‑select tracks, and benefit from cloud‑based synchronization.
3. Large‑Model Era Needs
TTS, AI‑generated music, AI packaging, text‑based editing, one‑click production. Model‑driven capabilities such as TTS, NLP‑generated titles and covers, and ASR‑based visual editing expand what an editor can do.
4. Ultimate Need
"I have no audio material, but I want to generate a podcast episode." This reflects a deadline‑driven, last‑minute scenario.
Cloud Editing (云剪辑)
As the flagship audio editing tool of the Xima creator ecosystem, Cloud Editing aims to provide a one‑stop solution for creators. Recognizing the ongoing model era, the tool seeks to showcase intelligent solutions and share functional and implementation insights.
Text Editing Use Cases
Audio editing often suffers from the invisibility of sound compared to text or video. Users repeatedly cut, listen, and recut, especially with long or low‑quality recordings.
Potential visual solutions include:
Click to locate a specific word or character in the audio.
Search audio fragments via text.
One‑click removal of filler words and breaths.
Drag‑select to delete mis‑spoken segments.
The answer is text‑based editing.
Feature Overview
Text editing includes per‑character cutting, search, quick selection, filler‑word detection/removal, breath detection/removal, and tagging. The usage is straightforward and is omitted here.
Implementation
Model Capability
The prerequisite for text editing is ASR (Automatic Speech Recognition), which converts spoken content into text. The following data structure is generated from ASR:
Engineering
With ASR in place, the next step is engineering a parallel text editing module that synchronizes with the audio module. This requires a text layer responsible for:
Rendering the transcript for full visibility during editing.
Clickable and highlightable text for rapid cursor positioning.
Searchable and highlightable text.
Drag‑selection with a delete dialog for one‑click removal of corresponding audio.
Sentence highlighting to indicate current playback.
Paragraph‑level quick selection for batch deletions.
Filler‑word detection based on a user‑provided list.
Breath detection with markers for later removal.
Tag recognition and insertion.
Implementation details involve wrapping each character in a Word component that stores start/end timestamps, using the Selection API to map indices to ASR timestamps, and handling edge cases such as cross‑segment selections.
Architecture Design
The minimal DOM unit is the character. The hierarchy is:
Word layer : Each character is a Word component with timestamps and state.
Sentence layer : Multiple Word components form a Sentence , which can be highlighted and detect filler or search terms.
Paragraph layer : Sentences compose a Paragraph , which also handles performance optimizations.
Paragraph components use IntersectionObserver to detect viewport visibility. When out of view, they downgrade inner sentences and words to plain text, preserving styling while reducing DOM overhead.
Channel (Data Sync)
The audio and text panels communicate via a channel that shares methods (delete, locate) and data (cursor time, selection range). Actions on one side automatically reflect on the other, ensuring synchronized deletion, positioning, and highlighting.
Performance Issues
Using characters as the smallest unit creates a massive number of DOM nodes, leading to noticeable lag during click, drag‑select, or delete operations, especially on long recordings.
Optimization Solutions
1. Reduce DOM Count : Paragraphs outside the viewport render as plain text, keeping only essential highlighting logic.
2. Optimize Drag‑Select Detection : Instead of per‑character event binding, the system now captures the start and end characters on mouse‑up, computes the time range, and updates the store once, handling cross‑segment selections efficiently.
Optimization Results
For a 3‑hour audio (~45,000 characters), the performance improvements are significant:
Old Version
New Version
Render
2000 ms
600 ms
Drag‑Select
3000 ms
400 ms
Delete
2000 ms
500 ms
Text Locate
2000 ms
200 ms
Audio Locate
1000 ms
100 ms
Undo/Redo
1500 ms
100 ms
Conclusion
Text editing is a key experiment in improving audio editing efficiency. With continued algorithmic and engineering efforts, we believe creators will eventually reach the highest level of Maslow’s hierarchy for audio production.
Ximalaya Technology Team
Official account of Ximalaya's technology team, sharing distilled technical experience and insights to grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.