Implementation and Practice of MRCP in Meituan Voice Interaction
This article details Meituan’s adoption of the Media Resource Control Protocol (MRCP) to standardize ASR and TTS integration, describing its architecture, key components, high‑availability deployment, and measured performance gains such as up to 55% latency reduction and a 15% increase in outbound call success rates.
Voice bots have become a leading AI application, typically chaining ASR, NLU, dialogue management, and TTS. To integrate these media services with telephone systems, Meituan adopts the Media Resource Control Protocol (MRCP), which standardizes control of speech resources over SIP, RTP and SDP.
What is MRCP
MRCP (Media Resource Control Protocol) defines Request, Response and Event messages for media services such as speech recognition (ASR) and synthesis (TTS). It relies on a media session created via RTP and a control session created via SIP/SDP. An example interaction for a TTS request is shown below:
MRCP/2.0 380 SPEAK 14321
Channel-Identifier: 43b9ae17@speechsynth
Content-Type: application/ssml+xml
Content-Length: 253
您好,有什么能帮助您?The server replies with IN-PROGRESS and finally SPEAK-COMPLETE indicating successful synthesis.
Usage Scenarios
MRCP is supported by major voice platforms such as Asterisk and FreeSWITCH. A simple call‑center architecture connects the telephony gateway (FreeSWITCH) to an MRCP‑enabled speech service, allowing the caller’s audio to be streamed to ASR, processed by dialogue logic, and answered via TTS.
Meituan’s In‑house ASR/TTS
Since 2018 Meituan has built proprietary ASR and TTS engines. In telephone‑call test sets the ASR word‑accuracy reaches 94.6%, compared with ~89% for leading vendors. The TTS system supports a wide range of voice styles and handles billions of daily requests across delivery, ride‑hailing, and customer‑service scenarios.
Why MRCP
Before MRCP, each ASR/TTS vendor required a custom HTTP/RPC integration, leading to high development cost, inconsistent voice quality, and latency. MRCP provides a uniform interface so that a single client program can work with any MRCP‑compliant engine, achieving “write once, run everywhere”.
Design Goals
Expose ASR/TTS via a standard protocol compatible with industry‑leading solutions.
Decouple dialogue logic from media processing to enable horizontal scaling and easier internal adoption.
Support external commercial partners by offering a private‑deployed MRCP server.
System Architecture
The AI for Contact Center (AICC) platform sits atop Meituan’s speech‑technology stack. MRCP‑TTS and MRCP‑ASR plugins run inside a UniMRCP‑based MRCP server. SIP proxy Kamailio provides seven‑layer load balancing and session stickiness, while the internal MGW (four‑layer load balancer) offers a stable virtual IP for client access. Figure 3 in the original article compares the traditional HTTP‑based flow with the MRCP‑based flow.
Key Technical Components
MRCP‑TTS and MRCP‑ASR plugins that wrap the internal ASR/TTS engines.
Event/Interface management for channel creation, requests, and responses.
Session management, thread‑pool handling, and configuration loading.
Logging integration with Meituan’s log framework and monitoring via the Raptor platform.
Authentication through Meituan’s unified voice‑service auth service.
Deployment Scheme
High availability is achieved by:
Unified service entry via MGW providing a VIP.
Resource isolation: separate MRCP‑TTS and MRCP‑ASR services.
Cross‑region, multi‑data‑center deployment.
Kamailio hot‑standby with SIP‑level routing to keep the same server for a call’s lifetime.
Seven‑layer load balancing for SIP/MRCP traffic, inserting <Via address> and <Contact address> to preserve stickiness.
Practice and Effects
MRCP is now used in many internal scenarios:
Outbound call robots (marketing, notification) – >1 million synthesis calls per day, peak ~1 k concurrent.
Inbound call‑center robots – ~10 million synthesis calls per day, peak >1 k concurrent.
External partners (e.g., WeiHu Technology, Zhongtong Tianhong) deploy a private MRCP server capable of handling ~600 concurrent streams in an 8C 16G container.
Performance improvements include:
End‑to‑end latency reduced by ~55 % for inbound calls and ~33 % for outbound calls.
Customer dissatisfaction decreased by 0.25–3.92 percentage points; average call handling time shortened by 2.19–5.30 seconds.
Outbound call success rate increased by 15 % in an A/B test after switching to MRCP‑TTS with the “MeiFanNan” voice.
Audio quality issues such as mismatched voice timbre and high synthesis delay were eliminated, as demonstrated by before/after recordings in the article.
Conclusion
MRCP and its associated TTS/ASR services have become mature, delivering stable, low‑latency speech interaction for Meituan’s internal and external customers. Future work will extend the protocol to support voiceprint recognition (VPR) and other biometric checks, further enriching Meituan’s AI‑driven voice ecosystem.
References
Media Resource Control Protocol (2022) Wikipedia.
Shi Junbo, Zhan Shubo (2010) MRCPv2 protocol and its application in distributed voice resource solutions.
Zhu, James (2018) MRCP Overview.
UniMRCP Introduction, https://www.unimrcp.org/.
Kamailio Introduction (2022), https://www.kamailio.org/w/.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
