Retrieval‑Based Dialogue System for Customer Service at Meituan

This article details Meituan's retrieval‑based dialogue framework for customer service, covering its five‑layer architecture, offline‑to‑online metric system, text and vector recall strategies, ranking models with pre‑training and contrastive learning, and real‑world deployment results across multiple business scenarios.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
Retrieval‑Based Dialogue System for Customer Service at Meituan

System Overview

The retrieval‑based dialogue system consists of five layers: data & platform, recall, ranking, strategy, and application. Offline pipelines clean and index historical sessions daily, producing a speech‑act index. Offline automatic metrics (BLEU, ROUGE, Recall, MRR) and offline human satisfaction scores are strongly correlated; ROUGE is used as the core offline metric and correlates with online adoption rate in AB tests.

Recall Layer

Text Recall

Historical dialogues are indexed in three ways:

Short‑term context: the last utterance, tokenized and stop‑word filtered, preserving maximal information.

Long‑term context: earlier turns, TF‑IDF top‑M keywords, filtering noise.

Robot‑generated context: tags from the dialogue platform.

Daily index updates ensure freshness. Experiments comparing a one‑month, two‑month, and half‑month history showed that one‑month logs provide the best trade‑off between coverage and latency.

Vector Recall

Three representation models were evaluated: BoW, BERT, and a dual‑encoder (two‑tower) model with shared BERT encoders. The dual‑encoder was selected because it models long‑range dependencies and captures the sequential nature of dialogue.

Positive samples are Context‑Response pairs extracted from logs. Negative samples include:

Predefined negatives sampled by rules.

Batch‑wise negatives (other samples in the same batch).

Hard negatives: Context‑based BM25 (CBM) and Response‑based BM25 (RBM) . Experiments show CBM improves recall while RBM degrades performance, likely because RBM yields false hard negatives.

Ranking Layer

Dialogue Pre‑training

Domain‑adaptive pre‑training tasks (MLM, NSP, SOP) and a task‑specific Next‑Sentence‑Generation (NSG) task were added on top of BERT. These tasks improve downstream response selection.

Negative Sampling for Ranking

Three negative sources are defined:

Exposure list (viewed but not clicked) – performs poorly due to many false negatives.

Recall list (candidates returned by the recall module) – provides hard negatives.

Random utterances – easy negatives.

Combining recall and random negatives yields stable results.

Learning‑to‑Rank Paradigms

Two paradigms were explored:

Pointwise: binary cross‑entropy on individual candidates.

Pairwise: RankNet (logistic) and hinge loss on candidate triples. Logistic RankNet outperforms hinge loss. Pairwise modeling with a fine‑tuning stage from pointwise to pairwise gives the best results, with online chat favoring pairwise and merchant IM favoring pointwise. Joint modeling (pointwise → pairwise) provides a modest gain.

Contrastive Learning

Data augmentation is applied separately to Context and Response:

Context augmentation: dropout, sentence re‑ordering (preserving speaker roles), token shuffling (using Jieba segmentation for Chinese).

Response augmentation: punctuation edits, random deletion or swapping of tokens (20% probability per token for sentences longer than five tokens).

Batch‑wise negatives benefit Context augmentation, while pair‑wise negatives benefit Response augmentation. R‑Drop regularization (KL divergence between two dropout passes) further stabilizes predictions, with KL loss outperforming MSE.

Personalization Modeling

Non‑text features are incorporated via a simplified wide‑and‑deep style model that adds a bias term to the text relevance score. Three feature groups are used:

Merchant‑specific: whether the candidate reply originates from the merchant’s own history.

Product‑specific: presence of product or deal cards in the conversation.

Time‑specific: recency of the answer for time‑sensitive queries.

Merchant‑specific features contribute the largest gain; all features improve ranking metrics.

Offline Evaluation

Recall Benchmarks

Top‑6 BLEU and ROUGE‑2 scores for different recall methods (Table 1) show the hierarchy: BM25 < BERT < Dual‑Encoder Adding hard negatives (CBM) and diversity representations (multi‑vector) yields additional improvements.

Ranking Benchmarks

Top‑1 BLEU, ROUGE‑2, and Recall (Table 2) demonstrate incremental gains from:

Pairwise learning (logistic RankNet) vs. pointwise.

Dialogue pre‑training (MLM, NSG, NSP).

Contrastive learning (both Context and Response augmentations).

Personalized non‑text features.

Pairwise learning alone does not guarantee improvement; the combination of all techniques yields the highest scores.

Application Scenarios

Merchant IM Reply Recommendation

A smart assistant suggests replies to merchants during live IM, increasing reply rates and reducing latency, especially for merchants without dedicated support staff.

Online Agent CHAT Input Suggestion

Input auto‑completion based on typed prefixes accelerates agent responses, particularly for new agents.

Knowledge‑Base Answer Supply

When merchants are offline, the system recommends answers for custom knowledge‑base intents, lowering manual effort and improving coverage.

Conclusion and Future Directions

The retrieval‑based dialogue framework is mature, with a proven offline‑to‑online metric pipeline and component‑level optimizations. Ongoing research focuses on:

Hybrid retrieval‑generation models to supplement recall.

Multimodal interaction (text + image).

Fully automated conversational agents for specific sub‑scenes.

Key Figures

System Architecture
System Architecture
Metric Pipeline
Metric Pipeline
Recall Benchmark Table
Recall Benchmark Table
Ranking Benchmark Table
Ranking Benchmark Table
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIrankingcustomer-serviceVector RetrievalMeituanretrieval-dialogue
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.