Cloud Computing 14 min read

Text-Based Audio Editing in Cloud Editing: Architecture, Features, and Performance Optimizations

The article discusses cloud-based audio editing tool architecture, focusing on text‑based editing enabled by ASR, hierarchical DOM (Word, Sentence, Paragraph), performance challenges with massive character nodes, and optimizations like viewport‑based rendering and efficient drag‑select, achieving large speed gains for long recordings.

Ximalaya Technology Team
Ximalaya Technology Team
Ximalaya Technology Team
Text-Based Audio Editing in Cloud Editing: Architecture, Features, and Performance Optimizations

Creator's Maslow Hierarchy of Needs

According to incomplete statistics, creators' needs for audio editing tools can be roughly divided into several stages.

1. Basic Editing Needs

Recording, local upload, waveform drawing, multi‑track, audio cutting, drag‑and‑drop, playback preview, synthesis and export. These are essential capabilities for any audio editing tool, enabling anything from a single podcast episode to an entire audiobook series.

2. Efficiency‑Boosting Needs

Audio tagging, multi‑selection, reverb, audio tuning, cloud collaboration, etc. These features help heavy‑duty editors quickly locate highlights, batch‑select tracks, and benefit from cloud‑based synchronization.

3. Large‑Model Era Needs

TTS, AI‑generated music, AI packaging, text‑based editing, one‑click production. Model‑driven capabilities such as TTS, NLP‑generated titles and covers, and ASR‑based visual editing expand what an editor can do.

4. Ultimate Need

"I have no audio material, but I want to generate a podcast episode." This reflects a deadline‑driven, last‑minute scenario.

Cloud Editing (云剪辑)

As the flagship audio editing tool of the Xima creator ecosystem, Cloud Editing aims to provide a one‑stop solution for creators. Recognizing the ongoing model era, the tool seeks to showcase intelligent solutions and share functional and implementation insights.

Text Editing Use Cases

Audio editing often suffers from the invisibility of sound compared to text or video. Users repeatedly cut, listen, and recut, especially with long or low‑quality recordings.

Potential visual solutions include:

Click to locate a specific word or character in the audio.

Search audio fragments via text.

One‑click removal of filler words and breaths.

Drag‑select to delete mis‑spoken segments.

The answer is text‑based editing.

Feature Overview

Text editing includes per‑character cutting, search, quick selection, filler‑word detection/removal, breath detection/removal, and tagging. The usage is straightforward and is omitted here.

Implementation

Model Capability

The prerequisite for text editing is ASR (Automatic Speech Recognition), which converts spoken content into text. The following data structure is generated from ASR:

Engineering

With ASR in place, the next step is engineering a parallel text editing module that synchronizes with the audio module. This requires a text layer responsible for:

Rendering the transcript for full visibility during editing.

Clickable and highlightable text for rapid cursor positioning.

Searchable and highlightable text.

Drag‑selection with a delete dialog for one‑click removal of corresponding audio.

Sentence highlighting to indicate current playback.

Paragraph‑level quick selection for batch deletions.

Filler‑word detection based on a user‑provided list.

Breath detection with markers for later removal.

Tag recognition and insertion.

Implementation details involve wrapping each character in a Word component that stores start/end timestamps, using the Selection API to map indices to ASR timestamps, and handling edge cases such as cross‑segment selections.

Architecture Design

The minimal DOM unit is the character. The hierarchy is:

Word layer : Each character is a Word component with timestamps and state.

Sentence layer : Multiple Word components form a Sentence , which can be highlighted and detect filler or search terms.

Paragraph layer : Sentences compose a Paragraph , which also handles performance optimizations.

Paragraph components use IntersectionObserver to detect viewport visibility. When out of view, they downgrade inner sentences and words to plain text, preserving styling while reducing DOM overhead.

Channel (Data Sync)

The audio and text panels communicate via a channel that shares methods (delete, locate) and data (cursor time, selection range). Actions on one side automatically reflect on the other, ensuring synchronized deletion, positioning, and highlighting.

Performance Issues

Using characters as the smallest unit creates a massive number of DOM nodes, leading to noticeable lag during click, drag‑select, or delete operations, especially on long recordings.

Optimization Solutions

1. Reduce DOM Count : Paragraphs outside the viewport render as plain text, keeping only essential highlighting logic.

2. Optimize Drag‑Select Detection : Instead of per‑character event binding, the system now captures the start and end characters on mouse‑up, computes the time range, and updates the store once, handling cross‑segment selections efficiently.

Optimization Results

For a 3‑hour audio (~45,000 characters), the performance improvements are significant:

Old Version

New Version

Render

2000 ms

600 ms

Drag‑Select

3000 ms

400 ms

Delete

2000 ms

500 ms

Text Locate

2000 ms

200 ms

Audio Locate

1000 ms

100 ms

Undo/Redo

1500 ms

100 ms

Conclusion

Text editing is a key experiment in improving audio editing efficiency. With continued algorithmic and engineering efforts, we believe creators will eventually reach the highest level of Maslow’s hierarchy for audio production.

performance optimizationfrontend developmenttext editingASRaudio editingcloud editing
Ximalaya Technology Team
Written by

Ximalaya Technology Team

Official account of Ximalaya's technology team, sharing distilled technical experience and insights to grow together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.