Artificial Intelligence 4 min read

Alibaba AI Wins Visual Dialogue Challenge with New Recursive Model

In the second Visual Dialogue Challenge, Alibaba’s AI outperformed ten teams—including Microsoft and Seoul University—achieving a 74.57% accuracy, surpassing the previous record by 16.82% and exceeding human performance, thanks to its novel recursive exploration dialogue model that integrates image recognition, relational reasoning, and natural language understanding.

Alibaba Cloud Developer

Jun 28, 2019

Alibaba AI Wins Visual Dialogue Challenge with New Recursive Model

Alibaba’s AI recently clinched the championship in the second Visual Dialogue Challenge, outperforming ten teams including Microsoft and Seoul University.

The competition, jointly organized by Georgia Tech, Facebook AI Research and the CVPR conference, is the most authoritative event in visual‑dialogue research.

Participants must answer any question about any image after viewing nearly ten thousand pictures. Alibaba’s AI achieved a 74.57% accuracy, improving the previous record by 16.82% and surpassing human performance of 64.27% on the same dataset.

Traditional visual AI focuses on object detection and recognition, lacking the ability to understand logical relationships and reason about complex scenes, such as “what color shirt is the boy next to the cat wearing?”.

Alibaba’s breakthrough is the “recursive exploration dialogue model”, which integrates image recognition, relational reasoning, and natural language understanding. It learns to mimic human cognition, identifying entities and their relationships, inferring events, modeling context, and generating natural, accurate responses to human queries.

Visual dialogue is an emerging AI research direction that teaches machines to discuss visual content in natural language, moving AI from mere visual perception to deeper understanding and inference.

Future applications include rescue robots that locate survivors in earthquake rubble, assistance for visually impaired users to comprehend photos, and more accurate intent understanding for autonomous vehicles.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

computer vision AI Natural Language Processing recursive model visual dialogue

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.