Artificial Intelligence 21 min read

How Evidence Generation Boosts Document-Grounded Dialogue with LLMs

This study introduces DGDE, a document‑grounded dialogue framework that leverages large language model‑generated evidence, combining retrieval, reranking, fine‑tuning, and iterative question correction to markedly improve accuracy, comprehensiveness, coherence, and completeness on the Doc2dial benchmark.

AsiaInfo Technology: New Tech Exploration

Apr 25, 2025

How Evidence Generation Boosts Document-Grounded Dialogue with LLMs

Introduction

Document‑grounded dialogue (DGD) requires retrieving and reasoning over domain‑specific documents to answer user queries. Large language models (LLMs) improve contextual understanding but can generate hallucinations. The paper proposes DGDE (Document‑Grounded Dialogue based on Evidence Generation), a goal‑oriented framework that combines retrieval, reranking, round‑wise LLM fine‑tuning, and iterative evidence generation to improve answer reliability.

Related Work

Existing DGD approaches fall into three categories: (1) representation‑based retrieval (e.g., BM25, DeepCT, DPR); (2) interaction‑based dialogue models (e.g., BERT, BART, T5); and (3) LLM‑driven pipelines (e.g., LangChain, Auto‑GPT). All suffer from limited semantic depth or hallucination issues.

DGDE Method

1. Retrieval & Reranking

The system first retrieves candidate passages using a vector‑space model and then reranks them with a text‑label classifier. The final score for a passage p given a query q is a weighted sum: score(p,q) = λ·sim_vec(p,q) + (1‑λ)·sim_label(p,q) where λ∈[0,1] balances vector similarity ( sim_vec) and label match ( sim_label). Top N passages are kept for downstream processing.

2. Round‑wise Fine‑tuning & Inference

During fine‑tuning each dialogue round is treated as an independent prediction target. Special tokens encode the round index and irrelevant turns are filtered out. The loss is summed over all rounds, enabling the model to predict multiple agent responses in a single forward pass.

3. Evidence Generation via Question Correction

The evidence generation loop runs up to a preset maximum number of cycles (hyper‑parameter C_max). Each cycle performs:

Prepare the user question q and the full document fragment set P.

Retrieve the top m relevant fragments {p_n,…,p_{n+m}} using the retrieval‑reranking module.

Feed q and the fragments to the LLM with a prompt template (see Figure 9). The LLM outputs a corrected question q′ that better aligns with the retrieved context.

Repeat steps 2‑3 until the corrected question converges or C_max is reached.

After the loop terminates, the most frequent evidence fragments (top N by occurrence) are selected and combined with the original question to form the final prompt for answer generation (Figure 11). The overall pseudo‑code is illustrated in Figure 12.

Experiments

Dataset and Metrics

Evaluation uses the Doc2dial benchmark (4,793 annotated dialogues, average 14 turns, 487 documents across four domains). Metrics: BLEU, ROUGE, METEOR (each measuring accuracy, comprehensiveness, coherence, and completeness).

Baselines

DGD1/DGD2: Vicuna‑13B‑16k, Qwen‑14B‑Chat, Baichuan2‑13B‑Chat combined with vector models (M3E, Text2Vec).

DGD3/DGD4: Same LLMs combined with keyword‑based retrieval (BPE+BM25, N‑gram+BM25).

DGDE: Proposed method (retrieval + reranking, round‑wise fine‑tuned LLM, evidence generation).

Overall Results

DGDE outperforms all baselines on every metric. Compared with DGD1, accuracy improves by 21.91 %, comprehensiveness by 10.89 %, coherence by 38.98 %, and completeness by 16.13 %. Compared with DGD3, the gains are 12.81 %, 69.83 %, 53.27 %, and 36.97 % respectively.

Ablation Study

Removing individual components shows:

LLM fine‑tuning contributes +11.61 % absolute accuracy.

Evidence generation contributes +27.35 % absolute comprehensiveness.

Both fine‑tuning and evidence generation add ~9 % to coherence and ~14 % to completeness.

Retrieval + reranking yields modest but consistent improvements across all dimensions.

Sub‑task Analyses

Retrieval & Reranking : Tested four vector encoders (Sentence‑BERT, CoROM‑Base, M3E, Text2Vec). DGDE’s hybrid scoring shows smoother recall growth as N increases (Figure 12) and transfers well to each encoder (Table 5).

Evidence Generation : Varying the number of iteration cycles reveals that three cycles achieve the best trade‑off between quality and hallucination (Figure 13).

LLM Fine‑tuning : Vicuna‑13B fine‑tuned on Doc2dial improves BLEU/ROUGE/METEOR for most encoders; gains are limited for Sentence‑BERT.

Retrieval Quantity : Using the top 3 retrieved passages yields optimal QA performance for all three LLMs (Figure 14).

Resource Consumption

Two Tesla A800 80 GB GPUs were used: one for fine‑tuning (≈37 h for six epochs) and one for inference. DGDE requires three LLM interactions per query, leading to higher latency than baselines (Table 7).

Conclusion

DGDE integrates retrieval‑reranking, round‑wise LLM fine‑tuning, and iterative evidence generation to substantially improve accuracy, comprehensiveness, coherence, and completeness on the Doc2dial benchmark. Ablation results confirm that evidence generation and LLM fine‑tuning are the primary performance drivers, while the retrieval‑reranking component provides consistent auxiliary gains. The framework is transferable to other retrieval‑augmented generation tasks such as information‑retrieval‑based summarization.

large language models Fine-tuning retrieval document-grounded dialogue evidence generation

Written by

AsiaInfo Technology: New Tech Exploration

AsiaInfo's cutting‑edge ICT viewpoints and industry insights, featuring its latest technology and product case studies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.