Analysis of 58.com Intelligent Voice Robot Backend Architecture
The article reviews the design and implementation of 58.com’s intelligent voice robot backend, detailing its four‑layer architecture, SIP/SDP/RTP protocols, multi‑round voice interaction flow, algorithm service modularization, SIP scheduling optimizations, and Java thread‑pool system tuning for high‑concurrency scenarios.
Based on voice semantics technology, intelligent voice robots can replace or assist humans in routine tasks, improving efficiency and revenue across enterprise scenarios. This talk analyzes the self‑developed backend architecture of 58.com’s voice robot, focusing on telephone communication using the JAIN‑SIP library and streaming speech‑recognition APIs, and shares technical details on concurrency control, scheduling strategies, and performance optimization.
1. Introduction
The robot leverages speech recognition, semantic understanding, and speech synthesis to enable multi‑turn dialogues, simulating human conversation and handling routine work in sales, customer service, and product operations.
2. Overall Architecture
The backend is abstracted into four layers: the access layer (WEB portal and UI), the business‑logic layer (access service, core service for voice I/O and dialogue management, algorithm sub‑services), the data‑storage layer (dialogue logs, recordings), and third‑party services.
3. Enabling Human‑Machine Voice Interaction
3.1 Protocol Overview
The system uses three key protocols:
SIP (Session Initiation Protocol) : establishes, modifies, and terminates multimedia sessions; status codes control call flow (e.g., 100, 180/183, 200).
SDP (Session Description Protocol) : negotiates media parameters such as codec (PCMU/8000) and transport addresses.
RTP (Real‑time Transport Protocol) : carries the audio stream with fields like payload type, sequence number, timestamp, and SSRC.
3.2 Connection Establishment
Using SIP, the robot initiates outbound calls, sends SIP signaling with SDP media description, receives carrier responses, and establishes a bidirectional audio channel for voice packet exchange.
3.3 Voice Interaction Flow
Incoming audio is buffered in a sliding window; a VAD algorithm detects speech start/end. Speech is streamed to a real‑time ASR service, producing text that is fed to an intent‑recognition module. Based on the intent, the robot selects a strategy (e.g., DTMF handling) and generates a response via TTS.
3.4 DTMF (Key Press) Recognition
Telephone key presses are encoded using DTMF (RFC2833), which maps each digit to a pair of tones; the robot decodes these tones to capture user selections.
3.5 Common Issues
Typical problems include noise (RTP sequence number disorder), audio clipping (extra 44‑byte header in synthesized speech), speed mismatches (sampling rate differences), and no audio (network, codec, or packet loss). Troubleshooting steps involve network checks, packet capture with tcpdump/wireshark, and verifying RTP header fields.
4. Algorithm Sub‑Services
Initially, all logic lived in a monolithic service, causing high coupling and maintenance cost. The team refactored into micro‑services: a core service orchestrates sub‑services, each handling a specific algorithm, enabling independent iteration and A/B testing via the 日晷 platform.
4.1 SIP Scheduling Optimization
The SIP scheduler selects usable SIP numbers for calls. Empirical findings:
Calls between numbers in the same locality have higher answer rates.
Numbers with higher usage receive more complaints, reducing answer rates.
Optimizations applied:
Group numbers by business line and caller city to narrow the candidate set.
Compute a weight based on distance and usage count; prioritize higher‑weight numbers.
Pre‑compute city‑to‑city distances and parallelize selection to improve performance.
5. System Optimization
The team tuned Java’s native thread pool for mixed I/O‑ and CPU‑bound workloads. By integrating WConfig, WMonitor, and JDK source insights, they enabled dynamic thread‑pool parameter adjustment, achieving noticeable latency reduction and cost‑effective scaling.
For the full PPT, follow the 58 Technology public account and request the material via the assistant’s WeChat (jishu‑58) with the keyword “58同城智能语音机器人后端架构解析”.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.