How to Build Real‑Time Streaming Speech Recognition with a Large‑Model API (Go & Python)
This guide explains the background of speech‑to‑text technology, introduces the large‑model streaming speech recognition API, walks through obtaining an API key, and provides detailed Go and Python code for establishing a WebSocket connection, sending full‑client and audio‑only requests, and parsing server responses.
Background
About 70% of information exchanged between people is conveyed by voice, making speech recognition a critical component of intelligent assistants, voice input, and smart speakers. To meet this demand, a large‑model streaming speech recognition API has been added to the API marketplace.
Overview
The API converts spoken audio into text using advanced speech‑recognition and natural‑language‑understanding techniques. It supports scenarios such as intelligent customer service, novel reading, online education, meeting transcription, and video subtitles. The service uses a bidirectional streaming mode: the server only returns data packets when the recognition result changes, improving real‑time factor (RTF) and latency for the first and last words.
API Documentation
Technical reference: https://zyun.360.cn/product/apimarketitem/asr
Usage Instructions
1. Obtain an API Key
Log in to the cloud platform, navigate to the API Marketplace, locate the Speech Recognition service, create an application and generate an API Key. Save the key for later use.
2. Call the API
Establish WebSocket connection
func (c *AsrWsClient) createConnection() error {
var tokenHeader = http.Header{"Authorization": []string{fmt.Sprintf("Bearer %s", "your token")}}
fmt.Println("Connecting to ws://audio-asr.api.zyuncs.com/sauc ...")
conn, resp, err := websocket.DefaultDialer.Dial("ws://audio-asr.api.zyuncs.com/sauc", tokenHeader)
if err != nil {
fmt.Println(err)
return err
}
log.Printf("logid: %s
", resp.Header.Get("X-Tt-Logid"))
c.connect = conn
return nil
}Send full client request
The first message after the connection is a full client request containing user metadata, audio metadata, and request configuration.
func NewFullClientRequest() []byte {
var request bytes.Buffer
request.Write(DefaultHeader().WithMessageTypeSpecificFlags(POS_SEQUENCE).toBytes())
payload := AsrRequestPayload{
User: UserMeta{Uid: "demo_uid"},
Audio: AudioMeta{Format: "wav", Codec: "raw", Rate: 16000, Bits: 16, Channel: 1},
Request: RequestMeta{
ModelName: "bigmodel",
EnableITN: true,
EnablePUNC: true,
EnableDDC: true,
ShowUtterances: true,
EnableNonstream: false,
},
}
payloadArr, _ := sonic.Marshal(payload)
payloadArr = GzipCompress(payloadArr)
payloadSizeArr := make([]byte, 4)
binary.BigEndian.PutUint32(payloadSizeArr, uint32(len(payloadArr)))
binary.Write(&request, binary.BigEndian, int32(1))
request.Write(payloadSizeArr)
request.Write(payloadArr)
return request.Bytes()
}Send audio‑only requests
After the full request, audio data is sent in a series of audio‑only client requests. Each request carries a sequence number (starting at 1, ending with a negative number to indicate stream end) and a GZIP‑compressed audio payload.
func (c *AsrWsClient) sendMessages(segmentSize int, content []byte, stopChan <-chan struct{}) error {
messageChan := make(chan []byte)
go func() {
for message := range messageChan {
if err := c.connect.WriteMessage(websocket.TextMessage, message); err != nil {
log.Printf("write message err: %s", err)
return
}
}
}()
audioSegments := splitAudio(content, segmentSize)
ticker := time.NewTicker(time.Duration(c.segmentDuration) * time.Millisecond)
defer ticker.Stop()
defer close(messageChan)
for _, segment := range audioSegments {
select {
case <-ticker.C:
if c.seq == len(audioSegments)+1 {
c.seq = -c.seq
}
message := NewAudioOnlyRequest(c.seq, segment)
messageChan <- message
log.Printf("send message: seq: %d", c.seq)
c.seq++
case <-stopChan:
return nil
}
}
return nil
}
func NewAudioOnlyRequest(seq int, segment []byte) []byte {
var request bytes.Buffer
header := DefaultHeader()
if seq < 0 {
header.WithMessageTypeSpecificFlags(NEG_WITH_SEQUENCE)
} else {
header.WithMessageTypeSpecificFlags(POS_SEQUENCE)
}
header.WithMessageType(CLIENT_AUDIO_ONLY_REQUEST)
request.Write(header.toBytes())
binary.Write(&request, binary.BigEndian, int32(seq))
payload := GzipCompress(segment)
binary.Write(&request, binary.BigEndian, int32(len(payload)))
request.Write(payload)
return request.Bytes()
}Parse full server response
The server replies with a full server response whose payload is a JSON object containing the recognition result. Parse the JSON according to the schema described in the documentation to extract the transcribed text.
Sample output
segmentSize is 6400
Connecting to ws://audio-asr.api.zyuncs.com/sauc ...
logid: 1234567890
send message: seq: 2
... (intermediate JSON fragments) ...
final result: "华为致力于把数字世界带入每个人、每个家庭、每个组织,构建万物互联的智能世界"Conclusion
The large‑model streaming speech recognition API provides an optimized bidirectional streaming solution that delivers higher accuracy and contextual awareness for real‑time voice applications. By obtaining an API key, establishing a WebSocket connection, sending a full client request followed by audio‑only packets, and parsing the server’s JSON response, developers can integrate high‑quality speech‑to‑text capabilities into their services.
360 Smart Cloud
Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
