Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue

The paper introduces the Answer‑Driven Visual State Estimator (ADVSE), which uses answer‑driven focusing attention and conditional visual information fusion to dynamically incorporate answers into visual dialogue, overcoming static encoding limitations and achieving state‑of‑the‑art performance on the GuessWhat?! question‑generation and guessing tasks.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue

Goal-oriented visual dialogue is a newly emerging task at the intersection of vision and language. It requires a machine to achieve a specific visual goal through multi‑turn conversations, which has both research significance and practical value.

Recently, Professor Wang Xiaojie’s team from Beijing University of Posts and Telecommunications, in collaboration with Meituan AI’s NLP Center, had their paper “Answer‑Driven Visual State Estimator for Goal‑Oriented Visual Dialogue” accepted at the top ACM MM2020 conference.

The paper identifies two major shortcomings in existing multimodal dialogue systems:

Language encoding treats the answer as a trivial token appended to the question, failing to distinguish different answers (e.g., Yes vs. No), even though the answer heavily influences subsequent visual attention and dialogue direction.

Visual encoding is often static throughout the dialogue, either concatenated with the dynamic language encoding or guided by a simple QA‑based attention, which cannot adapt to different answers.

To address these issues, the authors propose the Answer‑Driven Visual State Estimator (ADVSE), which consists of:

Answer‑Driven Focusing Attention (ADFA) : a gating mechanism that polarizes the attention guided by the current question and then reverses or maintains it based on the specific answer, thereby emphasizing the answer’s impact on dialogue state.

Conditional Visual Information Fusion (CVIF) : a module that adaptively fuses global image information with object‑level difference cues under the guidance of the current QA pair, producing an estimated visual state.

ADVSE is applied to the public GuessWhat?! dataset for both question generation (QGen) and guessing (Guesser). By integrating ADVSE with a hierarchical dialogue‑history encoder, the model achieves state‑of‑the‑art performance on both tasks, surpassing previous methods in success rate and error rate.

The code for ADVSE‑GuessWhat will be released at https://github.com/zipengxuc/ADVSE-GuessWhat .

In summary, the proposed ADVSE highlights the crucial role of answers in visual dialogue by introducing answer‑driven attention updating and conditional visual fusion, leading to superior results on goal‑oriented visual dialogue benchmarks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

multimodal AInatural language processingAttention Mechanismstate estimationgoal-orientedvisual dialogue
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.