How to Integrate a Multi‑Engine TTS Fusion Service for Stable High‑Quality Speech

This guide explains the challenges of using disparate TTS providers, introduces a unified multi‑engine speech synthesis service, details its technical highlights, typical use cases, and provides complete API specifications with request/response examples and authentication steps.

360 Smart Cloud
360 Smart Cloud
360 Smart Cloud
How to Integrate a Multi‑Engine TTS Fusion Service for Stable High‑Quality Speech

Background

Text‑to‑speech (TTS) is a core capability in intelligent customer service, content creation, education, media broadcasting, and government explanation. Existing vendor services differ in voice naturalness, latency, long‑text stability, and style, causing unstable synthesis, inconsistent voice quality, and high integration cost.

Product Overview

The TTS Fusion Service integrates three engines – Alibaba Cloud TTS, ByteDance premium long‑text + large‑model asynchronous TTS, and a self‑developed low‑latency engine – and provides intelligent scheduling, stable output, and a unified API.

Technical Highlights

Multi‑engine fusion algorithm: the same text is submitted to multiple TTS engines in parallel and the best result is automatically selected.

High‑concurrency, low‑latency architecture: microservice design with distributed queues.

Long‑text optimization: comma‑based sentence splitting and large‑model support for emotional expression and context understanding.

Typical Scenarios

Intelligent customer‑service voice broadcast

Audiobook / knowledge‑paid content production

Digital human or virtual anchor dubbing

Government explanation and public service

API Design

Create Asynchronous Task (/tts/async)

Method: POST

Headers: Content-Type: application/json, Authorization: Bearer <token> JSON body parameters: text (string, required): text to synthesize. voice (string, optional, default system voice): speaker. format (string, required): output audio format (e.g., wav, mp3). sample_rate (int, required): audio sample rate (e.g., 16000). volume (int, optional, default 100): volume 0‑100. speech_rate (int, optional, default 0): speech speed –100‑100. pitch_rate (int, optional, default 0): pitch –100‑100. enable_subtitle (bool, optional, default false): return subtitle per sentence. enable_notify (bool, optional, default false): enable asynchronous callback. notify_url (string, required if enable_notify is true): callback URL. comma_flag (bool, optional, default false): enable comma‑based sentence splitting. model_flag (bool, optional, default false): select ByteDance large‑model or premium long‑text.

Example request:

curl -X POST 'http://localhost:8080/tts/async' \
  --header 'Content-Type: application/json' \
  --header "Authorization: Bearer xxxx-1" \
  --data '{
    "text": "今天天气好晴朗",
    "voice": "微软-磁性男声",
    "format": "wav",
    "sample_rate": 16000,
    "enable_subtitle": true,
    "enable_notify": false,
    "speech_rate": 0
}'

Example response:

{
  "data": {"task_id":"b686a398866742498d4ea835143f5174"},
  "error_code":20000000,
  "error_message":"SUCCESS",
  "request_id":"ce55760d-43c7-4133-9478-ca6d744fd517",
  "status":200
}

Query Task (/tts/query)

Method: GET

Query parameters: request_id (string, required), task_id (string, required).

Example request:

curl -X GET 'http://localhost:8080/tts/query?request_id=ce55760d-...&task_id=b686a398...' \
  --header "Authorization: Bearer xxx-1"

Example response (includes audio download address and sentence‑level timing):

{
  "data":{
    "audio_address":"http://.../audio.wav",
    "sentences":[{"id":0,"text":"今天天气好晴朗","begin_time":170,"end_time":1795}]
  },
  "error_code":20000000,
  "error_message":"SUCCESS",
  "pod_ip":"11.70.176.21",
  "request_id":"tmp",
  "status":200
}

Calling Notes

The Authorization header value determines the engine channel: -1 for the self‑built engine, -2 for Alibaba Cloud, -5 for ByteDance (controlled by model_flag).

Use the returned task_id to poll the query endpoint until the task completes, then download the audio from audio_address.

When enable_subtitle is true, the response contains a sentences array with per‑sentence text and timestamps, useful for caption display or video alignment.

API Key Acquisition

An API Key/Token is required for authentication. Steps:

Open the API Marketplace at https://zyun.360.cn/product/apimarket.

Locate the audio service and create an application under the speech synthesis section.

Obtain the API Key from the application details.

Include the header Authorization: Bearer <Your-API-Key> in all API calls.

Collaboration

The TTS Fusion Service has completed core functionality and is open for pilot integration across industries. Partners can use the standard API to integrate and extend speech synthesis capabilities.

APIcloudaudiomulti-enginetext-to-speech
360 Smart Cloud
Written by

360 Smart Cloud

Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.