13 Must-Read Agent Papers from Meituan for ICML'26

This article presents a curated list of thirteen recent research papers on generalist agents—covering visual memory, environment synthesis, value modeling, self‑verification, robustness benchmarks, high‑resolution video generation, long‑horizon world models, and alignment fine‑tuning—along with brief abstracts and links to the PDFs for the upcoming Meituan ICML'26 sharing sessions.

PaperAgent
PaperAgent
PaperAgent
13 Must-Read Agent Papers from Meituan for ICML'26

MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning

Abstract: Long‑horizon agent reasoning requires compressing ever‑growing interaction histories into a limited context window. Existing memory systems serialize history as text, incurring uniform token‑level overhead that grows linearly with length. MemOCR proposes a multimodal memory agent that allocates memory density adaptively via visual layout, improving long‑context reasoning under tight budgets. It outperforms strong text baselines on multi‑hop and single‑hop QA benchmarks and achieves more efficient context usage under extreme budget constraints.

ScaleEnv: Scaling Environment Synthesis from Scratch for Generalist Interactive Tool‑Use Agent Training

Abstract: Equipping agents with interactive environments and verifiable tasks is crucial for training generalist agents that can adapt to diverse scenarios. ScaleEnv introduces a framework that builds fully interactive environments and verifiable tasks from scratch. Programmatic testing guarantees environment reliability, while tool‑dependency graphs and executable‑action verification ensure task completeness. Experiments on unseen multi‑round tool‑use benchmarks show significant performance gains, highlighting strong generalization.

V_0: A Generalist Value Model for Any Policy at State Zero

Abstract: Value models in RL‑trained LLMs suffer from a coupling dilemma: they must be trained synchronously with the evolving policy. V_0 decouples value estimation from specific policy parameters by redefining tasks as context‑learning problems, enabling prediction of unseen policy performance. Experiments show V_0 tracks policy evolution better than coupled value models during GRPO training, optimizes cold‑start budget allocation, and approaches the performance‑cost Pareto frontier in inference routing.

Learning to Self‑Verify Makes Language Models Better Reasoners

Abstract: Recent LLMs excel at generating promising reasoning paths for complex tasks but remain weak at verifying their own answers. Introducing self‑verification improves generation performance and yields more efficient reasoning trajectories. The authors propose a multitask reinforcement‑learning framework that jointly optimizes generation and self‑verification as complementary objectives. Experiments demonstrate superior generation and verification capabilities over generation‑only baselines.

AgentNoiseBench: Benchmarking Robustness of Tool‑Using LLM Agents Under Noisy Conditions

Abstract: As LLM‑based agents are deployed in real workflows, existing benchmarks fail to capture robustness under imperfect user instructions and unreliable tool feedback. AgentNoiseBench provides a systematic framework that injects user‑side instruction noise and tool‑side result noise, offering modular noise‑injection pipelines and multidimensional evaluation metrics. Evaluation of 25 tool‑using models reveals tool‑side noise typically causes larger performance drops than user‑side noise.

AJ‑Bench: Benchmarking Agent‑as‑a‑Judge for Environment‑Aware Evaluation

Abstract: Scaling LLM‑based agent training raises challenges for reliable behavior verification in complex environments. Existing rule‑based validators or LLM‑as‑Judge models struggle to generalize beyond narrow domains. Agent‑as‑a‑Judge interacts actively with environments and tools to gather verifiable evidence. AJ‑Bench evaluates this capability across search, data systems, and GUIs, covering 155 tasks and 516 annotated trajectories. Experiments show stable performance gains over LLM‑as‑Judge baselines while highlighting remaining open challenges.

LUVE: Latent‑Cascaded Ultra‑High‑Resolution Video Generation with Dual Frequency Experts

Abstract: To reconcile coherence and compute cost in ultra‑high‑resolution video generation, LUVE introduces a dual‑frequency expert latent‑cascading framework. It uses a three‑stage architecture: low‑resolution generation for motion consistency, latent up‑sampling for resolution boost with reduced memory, and high/low‑frequency expert fusion for semantic and detail refinement. Experiments demonstrate superior realism and fidelity, and the core idea has been applied to Meituan’s LongCat‑Video model.

Infinite‑World: Scaling Interactive World Models to 1000‑Frame Horizons via Pose‑Free Hierarchical Memory

Abstract: Infinite‑World targets long‑horizon interactive world models for real‑scene video. It tackles pose noise and sparse view‑revisits by (1) compressing history latents into a fixed‑budget pose‑free hierarchical memory, (2) adding uncertainty‑aware action annotation for noisy trajectories, and (3) fine‑tuning with high‑revisit data to enhance loop‑closure. The result is stable visual memory and action response over 1000+ frames.

WildActor: Unconstrained Identity‑Preserving Video Generation

Abstract: WildActor addresses inconsistencies in dynamic long‑shot and extreme‑view‑change video generation. It builds a 1.6M video / 18M multi‑view image dataset (Actor‑18M) to mitigate frontal‑bias, introduces Asymmetric Identity‑Preserving Attention (AIPA) to decouple identity from motion, and employs Identity‑Aware 3D Rotational Positional Encoding (I‑ROPE) to separate spatio‑temporal tokens. Experiments on the new Actor‑Bench show superior full‑body consistency, text alignment, and physical constancy over existing models.

Navigating the Pareto Frontier of Alignment: Spectrum‑Adaptive Fine‑Tuning for LLMs (SAFT)

Abstract: Standard supervised fine‑tuning (SFT) optimizes cross‑entropy, a smooth proxy for accuracy, but can overfit noise and be over‑confident. Direct Fine‑Tuning (DFT) optimizes a smooth approximation of accuracy, improving robustness at the cost of learning efficiency on hard samples. SAFT proposes a lightweight pre‑test: train SFT and DFT on a small subset, compare validation performance, then select geometric interpolation (Geo‑SAFT) for high‑SNR data or harmonic interpolation (Har‑SAFT) for low‑SNR data. This adaptive interpolation yields a better robustness‑efficiency trade‑off than linear interpolation.

TRIP‑Bench: A Benchmark for Long‑Horizon Interactive Agents in Real‑World Scenarios

Abstract: TRIP‑Bench introduces a travel‑planning benchmark for long‑horizon agents, built from real‑world data with 18 tools and 40+ travel constraints. It tests global constraint maintenance, tool invocation, user‑need changes, and iterative plan revisions over up to 15 dialogue rounds, 150+ tool calls, and >200k tokens. Existing models perform limitedly; the authors propose GTPO, a multi‑round RL method that improves robustness, allowing Qwen2.5‑32B‑Instruct to surpass Gemini‑3‑Pro.

InfVSR: Toward Consistency‑Driven Streaming Generative Video Super‑Resolution

Abstract: InfVSR tackles low inference efficiency, high memory usage, and temporal inconsistency of diffusion‑based video super‑resolution. It converts a pretrained video DiT into a causal streaming architecture, adds a rolling KV cache for smooth local transitions, and injects global semantic anchors via cross‑attention. Training combines block‑wise pixel supervision with cross‑block distribution matching, and distills diffusion into a single‑step inference. Experiments show SOTA performance, 58× faster inference, and constant memory for long sequences.

DRIVE: Distributional and Retrieval‑Augmented Bidding with Value Evaluation

Abstract: Standard Decision Transformers (DT) face three issues in complex bidding: average‑action trap, long‑tail hallucination, and lack of reasoning optimization. DRIVE proposes a generate‑retrieve‑evaluate loop: (1) replace deterministic output with a Gaussian mixture model to avoid policy collapse, (2) add a retrieval mechanism to enhance long‑tail memory and prevent hallucination, (3) employ an IQL critic for real‑time evaluation of generated and historical actions. This framework markedly improves decision robustness.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIvideo generationAgentLarge Language ModelbenchmarkMemoryICML
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.