Inside the Backend Architecture of 58.com’s Intelligent Voice Robot
This article details the design and implementation of 58.com’s intelligent voice robot backend, covering its four‑layer architecture, SIP/SDP/RTP protocols, connection setup, multi‑turn voice interaction flow, DTMF handling, common troubleshooting issues, algorithm service modularization, SIP scheduling optimizations, and Java thread‑pool system tuning.
Intelligent voice robots use speech recognition, semantic understanding, and speech synthesis to enable multi‑turn human‑machine dialogs, improving efficiency and revenue in enterprise scenarios. The article examines the backend architecture of 58.com’s self‑developed voice robot, focusing on telephone‑based interactions built with the JAIN‑SIP library and streaming ASR APIs.
1. Introduction
The robot leverages voice recognition, semantic parsing, and speech synthesis to simulate human conversation, handling routine tasks in sales, customer service, and product operations.
2. Overall Architecture
The backend is abstracted into four layers:
Access Layer : Web entry point and UI for configuring dialogs and viewing metrics.
Business Logic Layer : Access services for integration, core services for voice I/O and dialog management, and algorithm sub‑services for independent feature iteration.
Data Storage Layer : Persists dialog logs, recordings, and related metadata.
Third‑Party Services Layer : External dependencies required by the robot.
3. Implementing Human‑Machine Voice Interaction
3.1 Protocol Overview
SIP (Session Initiation Protocol) : Controls multimedia session setup, modification, and termination. Status codes such as 100 (searching), 180/183 (ringing), and 200 (answered) guide call flow and trigger voice packet handling.
SDP (Session Description Protocol) : Negotiates media parameters. Example fields: a=rtpmap:0 PCMU/8000 (8 kHz PCM‑U audio), c=IN IP4 … (connection IP), and m=audio … (media type).
RTP (Real‑time Transport Protocol) : Carries the actual audio stream. Important header fields include PT (payload type), Sequence Number, Timestamp, and SSRC.
3.2 Establishing the Connection
Before voice exchange, the robot initiates a SIP call. It sends SIP signaling with an embedded SDP description to the carrier network. After the carrier’s response, the robot receives a SIP reply containing the agreed media parameters, allowing both sides to start exchanging RTP audio packets.
3.3 Voice Interaction Flow
Incoming audio is processed with a sliding window VAD. When the first 13 frames contain speech, the robot treats it as the start of an utterance; otherwise it marks the end. Speech frames are streamed to a real‑time ASR service, producing text once the user stops speaking. The text is passed to an intent‑recognition service, which selects a predefined action (e.g., DTMF handling or scripted response) and generates a reply via TTS.
3.4 DTMF Key Recognition
Telephone key presses are encoded as DTMF tones (dual‑tone multi‑frequency). The robot uses the RFC2833 specification to detect these tones with minimal overhead, mapping each tone to its corresponding digit or symbol.
3.5 Common Issues
Noise : Often caused by out‑of‑order RTP Sequence Numbers; ensure they increment monotonically.
Audio clipping : Result of concatenating pre‑synthesized audio that contains a 44‑byte header; stripping the header resolves the artifact.
Speed mismatch : Occurs when sender and receiver use different sampling rates.
No audio : Diagnose by checking network connectivity, capturing RTP with tcpdump, verifying RTP header fields (PT, Sequence Number, Timestamp), and confirming matching codec configurations.
4. Algorithm Sub‑services
4.1 Service‑Oriented Architecture
Initially, all logic lived in a monolithic service, leading to high coupling and deployment risk. The system was refactored into independent algorithm services, enabling separate iteration, AB testing on a distributed experimentation platform, and clearer responsibility boundaries for data storage, model training, and monitoring.
4.2 SIP Scheduling Strategy Optimization
Two empirical findings guided improvements:
Local‑area numbers achieve higher answer rates.
Numbers with higher usage frequencies attract more user complaints, reducing answer rates.
Optimizations applied:
Group numbers by business line and caller city to shrink the candidate pool.
Compute a weight from geographic distance and usage count; prioritize high‑weight numbers.
Pre‑compute city‑to‑city distances and parallelize selection during scheduling to boost performance.
5. System Optimization
The Java thread‑pool was re‑engineered to allow dynamic configuration via a distributed config center (WConfig) and monitoring (WMonitor). IO‑bound workloads receive larger pools for parallelism, while CPU‑bound tasks consider context‑switch overhead. The changes yielded noticeable latency reduction and streamlined developer workflows by automating thread‑pool tuning.
Speaker : Li Hongxun, senior backend engineer at 58.com, responsible for the voice‑robot backend architecture since 2017.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
