How Keyframe‑Chaining VLA Gives Robots Long‑Term Memory and Faster Reasoning
The article introduces the Keyframe‑Chaining VLA (KC‑VLA) framework, which replaces dense video sampling with semantic keyframe linking to provide robots with global temporal awareness, presents a new long‑term memory benchmark, and demonstrates superior performance in both simulation and real‑world robotic experiments.
Limitations of Existing VLA Input Paradigms
Current Vision‑Language Agent (VLA) models excel at smooth motion but struggle with non‑Markovian tasks that require recalling past states, leading to three major problems: rapid forgetting, scene confusion, and computational explosion due to long video streams.
Instant Forgetting: actions performed in the previous second are quickly lost.
Scene Confusion: similar visual frames cause decision ambiguity.
Compute Explosion: processing ultra‑long videos slows inference dramatically.
The root cause is that mainstream VLA relies on short‑term dense observation without long‑range temporal understanding. Researchers from Tongji University propose the Keyframe‑Chaining VLA (KC‑VLA) framework, which extracts and links semantic keyframes instead of memorizing every frame, granting robots a global sense of time.
arXiv: https://arxiv.org/abs/2603.01465
Github: https://github.com/TJ-Spatial-Intelligence-Lab/KC-VLATemporal Abstraction and Decoupled Action Generation
KC‑VLA separates temporal abstraction from action generation. Instead of processing redundant continuous video, it builds a sparse, non‑uniform input sequence by concatenating selected keyframes with the current visual observation, enabling a flow‑matching action policy that anchors decisions to the global task timeline.
Lightweight Keyframe Selection Module (KSM)
The KSM acts as an online visual‑stream filter. It employs a two‑stage training and deployment protocol:
Stage 1 – Multi‑Task Metric Learning & Tri‑modal Negative Sampling: Triplet margin loss with three types of negatives (temporal neighbors, intra‑task phase negatives, inter‑task negatives) forces the visual encoder to learn discriminative, task‑agnostic features.
Stage 2 – Task‑Modulated Query & Greedy Temporal Smoothing: During inference, FiLM generates dynamic query parameters from the task ID, and a greedy temporal‑smoothing mechanism validates high‑confidence frames before adding them to a sparse history buffer.
Action Generation from Sparse Semantic History
Indexed keyframes are cached and structurally concatenated with the current observation, forming an irregular input sequence for a flow‑matching policy network. This design resolves non‑Markovian ambiguities by grounding actions in the full temporal context.
Long‑Term Memory Benchmark
Existing benchmarks (e.g., CALVIN, LIBERO‑Long) lack true non‑Markovian challenges or have short horizons. The authors construct a Memory‑Dependence Benchmark using the ManiSkill simulator, featuring tasks with an average length of 550 steps, strict non‑Markovian constraints, and heavy reliance on historical information.
Spatial Reconfiguration: Reorder shuffled blocks solely from memory.
Temporal Sequencing: Move objects in a precise red‑green‑blue order.
Counting & Latency: Act only after the second flash of a random signal.
Identity Tracking: Track and retrieve a specific identical‑looking block after rapid shuffling.
Simulation and Real‑World Experiments
In simulation, KC‑VLA achieves a 92.0% overall success rate, far surpassing the strongest long‑window baseline (Long‑term GR00T at 57.0%). Under various visual disturbances, the KSM maintains F1 > 95 and overall success between 71.5%–88.5%.
On a real‑world AgileX Piper robotic arm, fine‑tuned with only 50 expert demonstrations, KC‑VLA attains a 48.75% success rate and 75.3% stage completion, dramatically outpacing Diffusion Policy (3.8% success) and GR00T (6.3% success).
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
