Backend Development 13 min read

Analysis of 58.com Intelligent Voice Robot Backend Architecture

The article reviews the design and implementation of 58.com’s intelligent voice robot backend, detailing its four‑layer architecture, SIP/SDP/RTP protocols, multi‑round voice interaction flow, algorithm service modularization, SIP scheduling optimizations, and Java thread‑pool system tuning for high‑concurrency scenarios.

58 Tech

Sep 16, 2020

Analysis of 58.com Intelligent Voice Robot Backend Architecture

Based on voice semantics technology, intelligent voice robots can replace or assist humans in routine tasks, improving efficiency and revenue across enterprise scenarios. This talk analyzes the self‑developed backend architecture of 58.com’s voice robot, focusing on telephone communication using the JAIN‑SIP library and streaming speech‑recognition APIs, and shares technical details on concurrency control, scheduling strategies, and performance optimization.

1. Introduction

The robot leverages speech recognition, semantic understanding, and speech synthesis to enable multi‑turn dialogues, simulating human conversation and handling routine work in sales, customer service, and product operations.

2. Overall Architecture

The backend is abstracted into four layers: the access layer (WEB portal and UI), the business‑logic layer (access service, core service for voice I/O and dialogue management, algorithm sub‑services), the data‑storage layer (dialogue logs, recordings), and third‑party services.

3. Enabling Human‑Machine Voice Interaction

3.1 Protocol Overview

The system uses three key protocols:

SIP (Session Initiation Protocol) : establishes, modifies, and terminates multimedia sessions; status codes control call flow (e.g., 100, 180/183, 200).

SDP (Session Description Protocol) : negotiates media parameters such as codec (PCMU/8000) and transport addresses.

RTP (Real‑time Transport Protocol) : carries the audio stream with fields like payload type, sequence number, timestamp, and SSRC.

3.2 Connection Establishment

Using SIP, the robot initiates outbound calls, sends SIP signaling with SDP media description, receives carrier responses, and establishes a bidirectional audio channel for voice packet exchange.

3.3 Voice Interaction Flow

Incoming audio is buffered in a sliding window; a VAD algorithm detects speech start/end. Speech is streamed to a real‑time ASR service, producing text that is fed to an intent‑recognition module. Based on the intent, the robot selects a strategy (e.g., DTMF handling) and generates a response via TTS.

3.4 DTMF (Key Press) Recognition

Telephone key presses are encoded using DTMF (RFC2833), which maps each digit to a pair of tones; the robot decodes these tones to capture user selections.

3.5 Common Issues

Typical problems include noise (RTP sequence number disorder), audio clipping (extra 44‑byte header in synthesized speech), speed mismatches (sampling rate differences), and no audio (network, codec, or packet loss). Troubleshooting steps involve network checks, packet capture with tcpdump/wireshark, and verifying RTP header fields.

4. Algorithm Sub‑Services

Initially, all logic lived in a monolithic service, causing high coupling and maintenance cost. The team refactored into micro‑services: a core service orchestrates sub‑services, each handling a specific algorithm, enabling independent iteration and A/B testing via the 日晷 platform.

4.1 SIP Scheduling Optimization

The SIP scheduler selects usable SIP numbers for calls. Empirical findings:

Calls between numbers in the same locality have higher answer rates.

Numbers with higher usage receive more complaints, reducing answer rates.

Optimizations applied:

Group numbers by business line and caller city to narrow the candidate set.

Compute a weight based on distance and usage count; prioritize higher‑weight numbers.

Pre‑compute city‑to‑city distances and parallelize selection to improve performance.

5. System Optimization

The team tuned Java’s native thread pool for mixed I/O‑ and CPU‑bound workloads. By integrating WConfig, WMonitor, and JDK source insights, they enabled dynamic thread‑pool parameter adjustment, achieving noticeable latency reduction and cost‑effective scaling.

For the full PPT, follow the 58 Technology public account and request the material via the assistant’s WeChat (jishu‑58) with the keyword “58同城智能语音机器人后端架构解析”.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend-architecture Microservices RTP SIP voice bot Java thread pool

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.