Artificial Intelligence 9 min read

One-Click Deployment of Cutting-Edge Text-to-Video and Voice Interaction Models

This article introduces the state‑of‑the‑art Step‑Video‑T2V text‑to‑video model and the Step‑Audio‑Chat voice interaction model, outlines their technical specifications and benchmark results, and provides a detailed step‑by‑step guide for deploying both models with a single click using Alibaba Cloud's PAI Model Gallery.

Alibaba Cloud Big Data AI Platform

Feb 18, 2025

One-Click Deployment of Cutting-Edge Text-to-Video and Voice Interaction Models

PAI Model Gallery Overview

PAI Model Gallery is a component of Alibaba Cloud's AI platform that aggregates high‑quality open‑source pretrained models from the global AI community, covering LLM, AIGC, CV, and NLP domains such as Qwen and DeepSeek. It enables zero‑code end‑to‑end workflows from training to deployment and inference, simplifying AI development for developers and enterprises.

Access the gallery at https://x.sm.cn/GUZSiSc .

Step-Video-T2V Model Introduction

Step-Video‑T2V is a state‑of‑the‑art (SoTA) text‑to‑video pretrained model released by Step‑Star, featuring 30 billion parameters and capable of generating videos up to 204 frames. It uses a deep‑compression VAE that achieves a 16×16 spatial and 8× temporal compression ratio, and applies Direct Preference Optimization (DPO) to improve visual quality. Performance is evaluated on the newly released Step‑Video‑T2V‑Eval benchmark, which contains 128 Chinese user‑generated queries covering 11 content categories (motion, scenery, animals, compositional concepts, surreal,人物, 3D animation, cinematography, etc.). Results show superior instruction compliance, motion smoothness, physical plausibility, and aesthetic quality compared with existing open‑source video models.

Deploy Step-Video-T2V with One Click

1. Open the PAI Model Gallery page and select the appropriate region.

2. In the left navigation, choose Workspace List , select a workspace, then go to Quick Start > Model Gallery .

3. Locate the “Step-Video‑T2V” model card and click to open its detail page.

4. Click the Deploy button in the top‑right corner, choose deployment resources, and confirm to create a PAI‑EAS service.

5. After deployment, retrieve the endpoint and token from the service page to invoke the model.

Step-Audio-Chat Model Introduction

Step‑Audio is the industry’s first product‑grade open‑source voice interaction model. It can generate expressive speech with varied emotions, dialects, languages, singing styles, and personalized voice clones, supporting scenarios such as entertainment, social media, and gaming. Key innovations include:

130B multimodal model that unifies speech recognition, understanding, dialogue, voice cloning, and synthesis.

Data generation engine that creates high‑quality audio for training the 3B Step‑Audio‑TTS model.

Fine‑grained voice control enabling emotion, dialect, and singing style specifications.

ToolCall mechanism and role‑playing to enhance agent intelligence in complex tasks.

Benchmark results on five major public test sets (LlaMA Question, Web Questions, etc.) place Step‑Audio at the top among open‑source models, with outstanding performance on the HSK‑6 Chinese proficiency test.

Deploy Step-Audio-Chat with One Click

1. Open the PAI Model Gallery page and select the appropriate region.

2. In the model list, find and click the “Step‑Audio‑Chat” model card.

3. Click Deploy , provide a service name, and select resources. Because the model is 130B parameters, at least 300 GB of GPU memory (e.g., 4 × 80 GB cards) is required.

4. After deployment, use the provided endpoint and token to call the service; detailed invocation instructions are available on the model’s introduction page.

Technical Support

For updates, suggestions, or assistance, users can join the PAI‑Model Gallery community via DingTalk groups (IDs 79680024618 / 77450028832). The platform continuously adds new SOTA models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

text-to-video voice interaction AI model deployment state-of-the-art PAI Model Gallery

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.