Backend Development 13 min read

Analysis of 58.com Intelligent Voice Robot Backend Architecture

The article reviews the design and implementation of 58.com’s intelligent voice robot backend, detailing its four‑layer architecture, SIP/SDP/RTP protocols, multi‑round voice interaction flow, algorithm service modularization, SIP scheduling optimizations, and Java thread‑pool system tuning for high‑concurrency scenarios.

58 Tech
58 Tech
58 Tech
Analysis of 58.com Intelligent Voice Robot Backend Architecture

Based on voice semantics technology, intelligent voice robots can replace or assist humans in routine tasks, improving efficiency and revenue across enterprise scenarios. This talk analyzes the self‑developed backend architecture of 58.com’s voice robot, focusing on telephone communication using the JAIN‑SIP library and streaming speech‑recognition APIs, and shares technical details on concurrency control, scheduling strategies, and performance optimization.

1. Introduction

The robot leverages speech recognition, semantic understanding, and speech synthesis to enable multi‑turn dialogues, simulating human conversation and handling routine work in sales, customer service, and product operations.

2. Overall Architecture

The backend is abstracted into four layers: the access layer (WEB portal and UI), the business‑logic layer (access service, core service for voice I/O and dialogue management, algorithm sub‑services), the data‑storage layer (dialogue logs, recordings), and third‑party services.

3. Enabling Human‑Machine Voice Interaction

3.1 Protocol Overview

The system uses three key protocols:

SIP (Session Initiation Protocol) : establishes, modifies, and terminates multimedia sessions; status codes control call flow (e.g., 100, 180/183, 200).

SDP (Session Description Protocol) : negotiates media parameters such as codec (PCMU/8000) and transport addresses.

RTP (Real‑time Transport Protocol) : carries the audio stream with fields like payload type, sequence number, timestamp, and SSRC.

3.2 Connection Establishment

Using SIP, the robot initiates outbound calls, sends SIP signaling with SDP media description, receives carrier responses, and establishes a bidirectional audio channel for voice packet exchange.

3.3 Voice Interaction Flow

Incoming audio is buffered in a sliding window; a VAD algorithm detects speech start/end. Speech is streamed to a real‑time ASR service, producing text that is fed to an intent‑recognition module. Based on the intent, the robot selects a strategy (e.g., DTMF handling) and generates a response via TTS.

3.4 DTMF (Key Press) Recognition

Telephone key presses are encoded using DTMF (RFC2833), which maps each digit to a pair of tones; the robot decodes these tones to capture user selections.

3.5 Common Issues

Typical problems include noise (RTP sequence number disorder), audio clipping (extra 44‑byte header in synthesized speech), speed mismatches (sampling rate differences), and no audio (network, codec, or packet loss). Troubleshooting steps involve network checks, packet capture with tcpdump/wireshark, and verifying RTP header fields.

4. Algorithm Sub‑Services

Initially, all logic lived in a monolithic service, causing high coupling and maintenance cost. The team refactored into micro‑services: a core service orchestrates sub‑services, each handling a specific algorithm, enabling independent iteration and A/B testing via the 日晷 platform.

4.1 SIP Scheduling Optimization

The SIP scheduler selects usable SIP numbers for calls. Empirical findings:

Calls between numbers in the same locality have higher answer rates.

Numbers with higher usage receive more complaints, reducing answer rates.

Optimizations applied:

Group numbers by business line and caller city to narrow the candidate set.

Compute a weight based on distance and usage count; prioritize higher‑weight numbers.

Pre‑compute city‑to‑city distances and parallelize selection to improve performance.

5. System Optimization

The team tuned Java’s native thread pool for mixed I/O‑ and CPU‑bound workloads. By integrating WConfig, WMonitor, and JDK source insights, they enabled dynamic thread‑pool parameter adjustment, achieving noticeable latency reduction and cost‑effective scaling.

For the full PPT, follow the 58 Technology public account and request the material via the assistant’s WeChat (jishu‑58) with the keyword “58同城智能语音机器人后端架构解析”.

backend architectureMicroservicesRTPSIPVoice BotJava thread pool
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.