Artificial Intelligence 10 min read

Retrieval-Augmented Affordance Prediction Enables Zero-Shot Fine-Grained Robot Manipulation

The RAAP framework decouples static contact point and dynamic action direction, using retrieval‑augmented inference to achieve zero‑sample, cross‑category fine‑grained robot manipulation with only a few training examples per task.

Machine Heart

Apr 8, 2026

Retrieval-Augmented Affordance Prediction Enables Zero-Shot Fine-Grained Robot Manipulation

In embodied intelligence, affordance prediction—enabling a robot to infer "where to act" (contact point) and "how to act" (action direction) from visual observations—is a prerequisite for fine‑grained manipulation such as precisely pulling a drawer handle.

Existing approaches fall into two paradigms: retrieval‑based methods avoid large robot datasets but suffer from brittle single‑match behavior and poor coverage of unseen categories; large‑scale trained models learn transferable visual patterns yet often misplace contact points and predict incorrect motion directions, limiting spatial precision.

To overcome these limitations, the Southeast University team proposes RAAP (Retrieval‑Augmented Affordance Prediction). RAAP decomposes affordance into static contact points and dynamic action directions and designs complementary inference mechanisms: contact points are transferred via dense feature matching with the top‑1 retrieved reference, while action directions are predicted by a novel retrieval‑enhanced alignment model that aggregates multiple references using a dual‑weight attention mechanism. The entire framework requires only dozens of training samples per task to achieve zero‑shot, cross‑category fine‑grained robot operations.

RAAP builds a visual affordance memory from the DROID and HOI4D datasets, storing segmented object images, CLIP features, task labels, annotated 2D contact points, and normalized action vectors. At inference, CLIP text and image encoders retrieve the top‑K semantically and visually relevant references. For contact point localization, the top‑1 reference is matched at the pixel level using Stable Diffusion dense features, transferring the reference point to the query image. For dynamic direction prediction, query and reference images share a SigLIP‑2 backbone that extracts patch‑level features. Each reference’s action vector modulates its visual features via FiLM; the resulting reference features are concatenated into a Key‑Value matrix, and the query feature acts as the Query in a cross‑attention module. A Transformer encoder then regresses the final action direction.

The dual‑weight attention combines a CLIP cosine‑similarity weight (appearance prior) with a lightweight gating network output (semantic relevance). Both weights are normalized and summed, ensuring visually similar references are emphasized while semantically mismatched ones are suppressed, thereby guaranteeing high‑quality multi‑reference aggregation.

Predicted 2D affordances are lifted to 3D using camera intrinsics and depth, and the 2D action direction is transformed with local surface normals into a 3D displacement vector. Execution on a real robot employs Cartesian impedance control for safe, compliant interaction, completing the full pipeline from contact localization to motion execution.

Extensive experiments on DROID, HOI4D, and a real Franka Research 3 arm compare RAAP with RAM (single‑reference retrieval) and A0 (large‑scale affordance model). Using mean angular error (MAE) for direction, RAAP (K=3) achieves an overall 32.55° error, more than 50% lower than baselines, especially on open/close tasks where other methods exhibit global direction bias. Ablation studies confirm that removing either the gating weight or the similarity weight degrades performance, and that K=3 offers the best trade‑off between reference diversity and noise.

Real‑world robot trials test unseen‑object generalization (same task, different object instances) and cross‑category generalization (trained on opening/closing microwaves, tested on opening/closing cabinets). Each task is evaluated with 20 random‑position attempts, using only DROID/HOI4D training data and no real‑world demonstrations. RAAP outperforms RAM by 15–25 percentage points on unseen‑object drawer tasks and achieves the highest success rates across all pick‑and‑place tasks. In cross‑category tests, RAAP reaches 100% success on cabinet‑closing and consistently leads other methods.

In summary, RAAP provides a unified, retrieval‑augmented framework that decouples static and dynamic affordances, enabling low‑cost deployment of fine‑grained robot manipulation with minimal training data and strong zero‑shot generalization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

retrieval augmentation zero-shot learning affordance prediction DROID dataset Franka robot HOI4D robot manipulation

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.