Artificial Intelligence 9 min read

Implementation of Voice Call Functionality in an Intelligent Voice Robot

This article details the architecture and implementation of the voice call module of an intelligent voice robot, covering SIP signaling establishment, RTP session handling, audio encoding/decoding, sampling, and packetization to enable automated outbound calls and multi‑round voice interactions.

58 Tech

May 28, 2019

Implementation of Voice Call Functionality in an Intelligent Voice Robot

The intelligent voice robot, developed by the AI Lab of 58 Group's TEG platform, provides automated dialing, multi‑turn voice interaction, and intent recognition for scenarios such as sales calls, service promotion, and notifications. Its overall architecture includes access, editing/operation, logic, core services, and a web access platform, with this article focusing on the core services layer's voice call module.

The voice call capability enables the robot to automatically dial numbers, detect call states (answered, busy, no answer, etc.), and establish SIP signaling via a SIP proxy server. SIP INVITE, listeners, and microphone data threads manage the signaling, while the SIP voice interaction layer handles send/receive queues, voice transmission control, sampling, and codec processing.

During a call, the robot selects an opening phrase from a script library, places it into the send queue, resamples it to the required rate, encodes it to the codec required by the SIP provider, and transmits it via the proxy. Incoming audio is decoded, resampled to 16 kHz PCM, and placed into the receive queue for segmentation, speech recognition, intent detection, and response generation.

SIP signaling follows the standard flow: the robot sends an INVITE with SDP to the proxy, receives 100 Trying, then 180 Ringing, and finally 200 OK from the callee. After sending ACK, an RTP session is established for media exchange, and either side may terminate the call with a BYE request.

The robot uses the JAIN SIP library for signaling. An example SDP included in the INVITE describes local IP, ports, supported codecs, and sampling rates.

Voice transmission requires packetizing PCM frames. The buffer size is calculated as:

PCM Buffer size = sampling_rate * sampling_time * bit_depth / 8 * channel_count (Bytes)

With a 20 ms frame, 16 kHz sampling, 16‑bit depth, and mono channel, each frame is 640 Bytes, sent every 20 ms.

The robot supports multiple codecs (Opus, G711, G729) to accommodate various SIP providers, performing up‑sampling or down‑sampling as needed. Opus offers a wide bitrate range (6 kbps–510 kbps) and sample rates (8 kHz–48 kHz); G711 provides good quality at 64 kbps; G729 achieves high compression (32 kbps) with acceptable quality.

All audio data is standardized to 16 kHz, 16‑bit PCM. When interacting with providers using 8 kHz, the robot down‑samples outgoing audio and up‑samples incoming audio, applying anti‑aliasing and anti‑imaging low‑pass filters before and after the conversion.

In production, the voice robot is already handling tens of thousands of outbound calls daily for recruitment and directory services, with plans to expand to customer notifications and resume verification. The article concludes that voice calling is a foundational capability that will continue to be optimized for broader SIP provider compatibility and improved call quality.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI SIP Telephony audio codec voice bot speech processing

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.