Artificial Intelligence 9 min read

Implementation of Voice Call Functionality in an Intelligent Voice Robot

This article details the architecture and implementation of the voice call module of an intelligent voice robot, covering SIP signaling establishment, RTP session handling, audio encoding/decoding, sampling, and packetization to enable automated outbound calls and multi‑round voice interactions.

58 Tech
58 Tech
58 Tech
Implementation of Voice Call Functionality in an Intelligent Voice Robot

The intelligent voice robot, developed by the AI Lab of 58 Group's TEG platform, provides automated dialing, multi‑turn voice interaction, and intent recognition for scenarios such as sales calls, service promotion, and notifications. Its overall architecture includes access, editing/operation, logic, core services, and a web access platform, with this article focusing on the core services layer's voice call module.

The voice call capability enables the robot to automatically dial numbers, detect call states (answered, busy, no answer, etc.), and establish SIP signaling via a SIP proxy server. SIP INVITE, listeners, and microphone data threads manage the signaling, while the SIP voice interaction layer handles send/receive queues, voice transmission control, sampling, and codec processing.

During a call, the robot selects an opening phrase from a script library, places it into the send queue, resamples it to the required rate, encodes it to the codec required by the SIP provider, and transmits it via the proxy. Incoming audio is decoded, resampled to 16 kHz PCM, and placed into the receive queue for segmentation, speech recognition, intent detection, and response generation.

SIP signaling follows the standard flow: the robot sends an INVITE with SDP to the proxy, receives 100 Trying, then 180 Ringing, and finally 200 OK from the callee. After sending ACK, an RTP session is established for media exchange, and either side may terminate the call with a BYE request.

The robot uses the JAIN SIP library for signaling. An example SDP included in the INVITE describes local IP, ports, supported codecs, and sampling rates.

Voice transmission requires packetizing PCM frames. The buffer size is calculated as:

PCM Buffer size = sampling_rate * sampling_time * bit_depth / 8 * channel_count (Bytes)

With a 20 ms frame, 16 kHz sampling, 16‑bit depth, and mono channel, each frame is 640 Bytes, sent every 20 ms.

The robot supports multiple codecs (Opus, G711, G729) to accommodate various SIP providers, performing up‑sampling or down‑sampling as needed. Opus offers a wide bitrate range (6 kbps–510 kbps) and sample rates (8 kHz–48 kHz); G711 provides good quality at 64 kbps; G729 achieves high compression (32 kbps) with acceptable quality.

All audio data is standardized to 16 kHz, 16‑bit PCM. When interacting with providers using 8 kHz, the robot down‑samples outgoing audio and up‑samples incoming audio, applying anti‑aliasing and anti‑imaging low‑pass filters before and after the conversion.

In production, the voice robot is already handling tens of thousands of outbound calls daily for recruitment and directory services, with plans to expand to customer notifications and resume verification. The article concludes that voice calling is a foundational capability that will continue to be optimized for broader SIP provider compatibility and improved call quality.

aiSIPTelephonyaudio codecVoice Botspeech processing
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.