Tagged articles

benchmark evaluation

32 articles · Page 1 of 1
Design Hub
Design Hub
Jun 29, 2026 · Artificial Intelligence

When AI Starts Getting Real Work Done, Are We Ready to Evaluate It?

The article analyzes recent AI updates—from DeepSeek's DSpark inference boost and FlashAttention‑4's kernel redesign to Codex UI tweaks and design‑mode tools—arguing that the competition is shifting from answering questions to actually completing tasks, and it highlights three layers of progress, evaluation challenges, and the practical questions we must now ask of AI agents.

AIDeepSeekDesign Tools
0 likes · 19 min read
When AI Starts Getting Real Work Done, Are We Ready to Evaluate It?
PaperAgent
PaperAgent
Jun 20, 2026 · Artificial Intelligence

Vertical Domain Agents Gain 88.5% Boost by Adapting the Runtime Interface, Not Retraining

The paper shows that many failures of deterministic LLM agents stem from mismatched model‑environment interfaces, and introduces LIFE‑HARNESS—a four‑layer runtime harness that extracts reusable failure patterns from training trajectories without updating model weights, delivering an average 88.5% relative performance gain across 126 model‑environment settings.

Deterministic AgentsLLM AgentsLife-Harness
0 likes · 8 min read
Vertical Domain Agents Gain 88.5% Boost by Adapting the Runtime Interface, Not Retraining
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Jun 17, 2026 · Artificial Intelligence

How OmniVideo-100K Generates High‑Quality Audio‑Video Training Data for Better Multimodal Understanding

The article analyzes why existing audio‑video QA pipelines break narrative continuity, proposes a structured‑script and evidence‑chain approach to automatically build the OmniVideo-100K dataset of 100K high‑quality QA pairs, and shows that fine‑tuning open‑source multimodal models on this data yields consistent accuracy gains across multiple benchmarks.

OmniVideo-100Kaudio-video datasetbenchmark evaluation
0 likes · 12 min read
How OmniVideo-100K Generates High‑Quality Audio‑Video Training Data for Better Multimodal Understanding
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Jun 14, 2026 · Artificial Intelligence

Deep Pre-Alignment (DPA): Tsinghua’s New VLM Architecture Aligns Vision Before Language Understanding

The paper introduces Deep Pre‑Alignment (DPA), a novel Vision‑Language Model architecture that inserts a perceiver VLM to pre‑align visual features with the LLM’s text space, reducing alignment cost, preserving language ability, and delivering consistent multimodal performance gains across multiple benchmarks with minimal inference overhead.

Deep Pre-AlignmentLLMMultimodal Learning
0 likes · 10 min read
Deep Pre-Alignment (DPA): Tsinghua’s New VLM Architecture Aligns Vision Before Language Understanding
Machine Heart
Machine Heart
Jun 11, 2026 · Artificial Intelligence

Agent‑Driven Newton Toolbox: A New Paradigm for Grounded Video Generation

NEWTON introduces an Agent‑centric framework that augments existing video generators with a planner, physics‑aware tools, and a verification loop, enabling multi‑round refinement and significantly improving physical consistency on benchmarks without retraining the underlying generator.

agentic AIbenchmark evaluationphysics grounding
0 likes · 8 min read
Agent‑Driven Newton Toolbox: A New Paradigm for Grounded Video Generation
Data Party THU
Data Party THU
Jun 11, 2026 · Artificial Intelligence

Boost 18 LLM Agents Without Retraining Using LIFE‑HARNESS

The article introduces LIFE‑HARNESS, a runtime‑interface adaptation framework that keeps model weights unchanged, extracts reusable failure patterns from a single model's training trace, and achieves an average 88.5% relative performance gain across 18 LLM agents and 7 deterministic environments, with successful transfer to 17 other models.

LLM AgentsRuntime Harnessbenchmark evaluation
0 likes · 8 min read
Boost 18 LLM Agents Without Retraining Using LIFE‑HARNESS
Alibaba Cloud Native
Alibaba Cloud Native
Jun 8, 2026 · Artificial Intelligence

Code Harness vs. Model-Driven Harness: Can Agent Control Be Expressed as Executable Natural Language?

The article reviews the "Natural-Language Agent Harnesses" paper, explains the distinction between code, middleware, and harness layers for LLM agents, introduces NLAH and IHR concepts, and details experimental evaluations that show natural‑language harnesses can match code‑based control while exposing new trade‑offs and risks.

Intelligent Harness RuntimeLLM AgentsModule Ablation
0 likes · 13 min read
Code Harness vs. Model-Driven Harness: Can Agent Control Be Expressed as Executable Natural Language?
Machine Heart
Machine Heart
Jun 5, 2026 · Artificial Intelligence

Building More Realistic Mobile Agent Worlds for Large‑Scale Training

The article examines the PhoneWorld project, which reconstructs realistic Android app environments from user interaction traces to create scalable, resettable, and verifiable mock apps, enabling large‑scale training and evaluation of Mobile Agents with demonstrated performance gains across multiple benchmarks.

Large‑Scale TrainingMobile AgentPhoneWorld
0 likes · 12 min read
Building More Realistic Mobile Agent Worlds for Large‑Scale Training
PaperAgent
PaperAgent
Jun 4, 2026 · Artificial Intelligence

SkillOpt: Enabling Self‑Evolving Agent Skills via Text‑Space Optimization

SkillOpt reframes LLM agent skills as trainable external state, applying a deep‑learning‑style optimizer to systematically improve skill documents, and demonstrates across six benchmarks, seven models, and three execution modes that this approach yields consistent, large gains and robust transferability.

Agent SkillsSelf-Evolving AgentsSkillOpt
0 likes · 12 min read
SkillOpt: Enabling Self‑Evolving Agent Skills via Text‑Space Optimization
Machine Heart
Machine Heart
Jun 1, 2026 · Artificial Intelligence

Thought-Aligner: Enabling Agents to Think Twice Before Acting

Thought-Aligner introduces a lightweight, plug‑in safety layer that corrects unsafe reasoning in AI agents during the millisecond window between thought generation and action execution, dramatically improving behavioral safety while preserving task usefulness across benchmark and real‑world deployments.

AI safetyagent alignmentbenchmark evaluation
0 likes · 11 min read
Thought-Aligner: Enabling Agents to Think Twice Before Acting
Machine Heart
Machine Heart
May 31, 2026 · Artificial Intelligence

Microsoft’s SkillOpt Turns Agent Skill Docs into Trainable Parameters for Self‑Evolving AI

Microsoft’s newly open‑source SkillOpt framework treats an agent’s skill document as external weights, applying a rollout‑reflect‑edit‑gate training loop with textual learning rates and rejected‑edit buffers, enabling self‑evolving skills that achieve optimal or tied‑optimal results across 52 model‑benchmark‑environment combinations.

AI agentsMicrosoftSkillOpt
0 likes · 12 min read
Microsoft’s SkillOpt Turns Agent Skill Docs into Trainable Parameters for Self‑Evolving AI
SuanNi
SuanNi
May 27, 2026 · Artificial Intelligence

Can Agent Skills Be Trained Like Neural Networks? SkillOpt Demonstrates Success

SkillOpt treats an agent’s Skill document as a trainable external state, applying classic deep‑learning tools such as epochs, batch size, learning rate and validation gating, and in experiments across 52 benchmark units it lifts GPT‑5.5 performance by an average of 23.5 points while enabling cross‑model and cross‑environment transfer with no additional inference cost.

Agent SkillDeep Learning OptimizationLLM
0 likes · 11 min read
Can Agent Skills Be Trained Like Neural Networks? SkillOpt Demonstrates Success
AI Engineering
AI Engineering
May 26, 2026 · Artificial Intelligence

Training Only the Skill Document While Keeping Model Weights Frozen (SkillOpt)

Microsoft Research introduces SkillOpt, a method that freezes large‑model weights and instead trains a natural‑language skill document as the sole learnable parameter, using a rollout‑reflect‑edit‑gate loop, achieving optimal results across 52 benchmark‑model‑environment combinations and demonstrating strong transferability.

LLM AgentsSkillOptbenchmark evaluation
0 likes · 9 min read
Training Only the Skill Document While Keeping Model Weights Frozen (SkillOpt)
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 24, 2026 · Artificial Intelligence

The First Visual‑Language Parallel Thinking Framework: Unpacking Its Core Mechanisms

The paper introduces Visual Para-Thinker, a parallel‑thinking framework for large‑scale visual‑language models that uses visual‑centered block and scan path partitions, Path‑aware Attention and Learnable Parallel Rotary Position Embedding, and demonstrates consistent gains across counting, visual search, hallucination and grounding benchmarks.

LPRoPEMultimodal AIPa-Attention
0 likes · 11 min read
The First Visual‑Language Parallel Thinking Framework: Unpacking Its Core Mechanisms
Machine Heart
Machine Heart
May 18, 2026 · Artificial Intelligence

ICML 2026: From Single‑Threaded Thinking to Native Parallel Reasoning in Agents

The paper introduces Native Parallel Reasoner (NPR), a framework that lets language agents generate and maintain multiple reasoning paths using a three‑stage self‑distillation and parallel reinforcement‑learning training paradigm, achieving up to 4.6× speedup and significant accuracy gains across eight reasoning benchmarks.

AI reasoningLarge Language ModelsNative Parallel Reasoner
0 likes · 18 min read
ICML 2026: From Single‑Threaded Thinking to Native Parallel Reasoning in Agents
PaperAgent
PaperAgent
May 13, 2026 · Artificial Intelligence

One-for-All Multi-Agent Collaboration: Adaptive Cross-Task Topology Design

The paper introduces OFA-MAS, a one‑for‑all multi‑agent system that learns a universal topology designer using task‑aware graph encoding and a Mixture‑of‑Experts generator, achieving superior performance, OOD generalization, robustness, and efficiency across six major benchmarks.

LLMMixture of ExpertsMulti-Agent Systems
0 likes · 14 min read
One-for-All Multi-Agent Collaboration: Adaptive Cross-Task Topology Design
Machine Heart
Machine Heart
May 12, 2026 · Artificial Intelligence

DECS Cuts Overthinking in Models: Halve Inference Tokens and Raise Accuracy

DECS, a novel training framework introduced by researchers from Fudan, Shanghai Jiao Tong, and the Shanghai AI Lab, theoretically exposes the flaws of length‑penalty rewards and, through token‑level reward decoupling and dynamic batch scheduling, reduces inference token counts by over 50% while improving accuracy across multiple benchmarks.

DECSLarge Language Modelsbenchmark evaluation
0 likes · 9 min read
DECS Cuts Overthinking in Models: Halve Inference Tokens and Raise Accuracy
Machine Heart
Machine Heart
May 5, 2026 · Artificial Intelligence

Agent-World: Scaling Real-World Environments for Co‑Evolving Agents and Their Worlds

Agent-World introduces a universal training arena that automatically mines real‑world data from the internet to build over 1,900 diverse environments and 19,800 tools, then generates long‑horizon tasks through graph‑based and programmatic synthesis, creating a self‑evolving loop where agents are evaluated, diagnosed, and the environment is refined, achieving state‑of‑the‑art results on 23 benchmarks.

AI agentsAgent-WorldLarge‑Scale Training
0 likes · 14 min read
Agent-World: Scaling Real-World Environments for Co‑Evolving Agents and Their Worlds
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Apr 14, 2026 · Artificial Intelligence

How Self‑Supervised HINTS Extracts Human Insights from Time Series to Boost Forecast Accuracy

The paper introduces HINTS, a two‑stage self‑supervised framework that leverages Friedkin‑Johnsen opinion dynamics to mine latent human‑driven factors from time‑series residuals, integrates them via attention into state‑of‑the‑art predictors, and demonstrates consistent accuracy gains and interpretability across nine benchmark and real‑world datasets.

Attention MechanismFriedkin-Johnsen modelTime Series Forecasting
0 likes · 17 min read
How Self‑Supervised HINTS Extracts Human Insights from Time Series to Boost Forecast Accuracy
SuanNi
SuanNi
Apr 3, 2026 · Artificial Intelligence

How GEMS Lets a 6B Open‑Source Model Beat Top Closed‑Source Image Generators

The article presents the GEMS (Agent‑Native Multimodal Generation with Memory and Skills) framework, detailing its multi‑agent loop, hierarchical memory compression, on‑demand skill modules, and extensive benchmark results that show a lightweight 6B model surpassing larger proprietary systems on complex image‑generation tasks.

GEMSMemory compressionMultimodal AI
0 likes · 14 min read
How GEMS Lets a 6B Open‑Source Model Beat Top Closed‑Source Image Generators
SuanNi
SuanNi
Mar 20, 2026 · Artificial Intelligence

How XSKILL Lets Multimodal AI Agents Learn Without Updating Parameters

XSKILL introduces a dual‑stream framework that separates task‑level skills stored as Markdown and action‑level experiences stored as JSON, enabling multimodal large language model agents to continuously improve by extracting, summarizing, and reusing knowledge from past trajectories without modifying model parameters, achieving significant gains across visual tool, multimodal search, and integrated benchmarks.

Agent frameworkMultimodal AIbenchmark evaluation
0 likes · 12 min read
How XSKILL Lets Multimodal AI Agents Learn Without Updating Parameters
Instant Consumer Technology Team
Instant Consumer Technology Team
Dec 18, 2025 · Artificial Intelligence

How a Multi‑Agent Framework Boosts Graph Chain‑of‑Thought Reasoning Efficiency

The paper introduces GLM, a multi‑agent Graph‑CoT framework with an optimized LLM serving architecture that dramatically improves accuracy, reduces token consumption, lowers latency, and increases throughput across diverse domains, as demonstrated by extensive GRBench evaluations.

LLM Optimizationbenchmark evaluationgraph reasoning
0 likes · 10 min read
How a Multi‑Agent Framework Boosts Graph Chain‑of‑Thought Reasoning Efficiency
AntTech
AntTech
Oct 14, 2025 · Artificial Intelligence

How Ring-1T Achieves Trillion-Scale Deep Thinking and Competitive Benchmarks

The Ring-1T model, a trillion-parameter AI system released as open source, leverages advanced reinforcement learning techniques, extensive benchmark evaluations, and custom training frameworks to deliver balanced performance across math, code, reasoning, and creative tasks while highlighting current limitations and future development plans.

AI Modelbenchmark evaluationdeep reasoning
0 likes · 8 min read
How Ring-1T Achieves Trillion-Scale Deep Thinking and Competitive Benchmarks
DataFunTalk
DataFunTalk
Jun 17, 2025 · Artificial Intelligence

MiniMax M1: Open‑Source LLM That Rivals Gemini 2.5 Pro in Long‑Context Benchmarks

MiniMax’s newly released open‑source M1 model, built on the Lightning Attention‑enhanced MiniMax‑01 base, delivers up to 1 million token context, achieves near‑state‑of‑the‑art performance on MRCR and other long‑context benchmarks, and showcases impressive multilingual translation, code completion, and creative applications.

Lightning AttentionMiniMaxbenchmark evaluation
0 likes · 11 min read
MiniMax M1: Open‑Source LLM That Rivals Gemini 2.5 Pro in Long‑Context Benchmarks
AI Frontier Lectures
AI Frontier Lectures
Apr 6, 2025 · Artificial Intelligence

Can Multi‑Round Thinking Boost LLM Accuracy Without Extra Training?

A new study from the a‑m‑team introduces “Think Twice”, a test‑time multi‑round reasoning technique that, without additional training or model changes, repeatedly prompts large language models to self‑correct, yielding notable accuracy gains across benchmarks such as AIME, MATH‑500, GPQA‑Diamond and LiveCodeBench, while also producing shorter, more confident answers.

Artificial IntelligenceLLMMulti-round reasoning
0 likes · 6 min read
Can Multi‑Round Thinking Boost LLM Accuracy Without Extra Training?
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 29, 2025 · Artificial Intelligence

How DistilQwen2.5‑R1 Boosts Small‑Model Reasoning with Innovative Knowledge Distillation

The article introduces the DistilQwen2.5‑R1 series, which leverages a novel knowledge‑distillation pipeline—including CoT data evaluation, improvement, and validation—to transfer deep reasoning abilities from large models like DeepSeek‑R1 to compact models, achieving superior performance across math, code, and scientific benchmarks and providing open‑source checkpoints and deployment guides for practical use.

AI inferenceKnowledge DistillationLarge Language Models
0 likes · 17 min read
How DistilQwen2.5‑R1 Boosts Small‑Model Reasoning with Innovative Knowledge Distillation
Baobao Algorithm Notes
Baobao Algorithm Notes
Jun 28, 2024 · Artificial Intelligence

What Makes Gemma 2 a Competitive Open‑Source LLM? Architecture, Training, and Evaluation Insights

The article provides a detailed technical overview of Gemma 2, covering its decoder‑only transformer design, novel attention mechanisms, logit soft‑capping, RMSNorm, knowledge‑distillation training on trillions of tokens, extensive pre‑training infrastructure, and benchmark evaluations that demonstrate its competitiveness against larger proprietary models.

AIGemma 2benchmark evaluation
0 likes · 14 min read
What Makes Gemma 2 a Competitive Open‑Source LLM? Architecture, Training, and Evaluation Insights
AntTech
AntTech
Apr 17, 2024 · Artificial Intelligence

LLMRG: Improving Recommendations through Large Language Model Reasoning Graphs

LLMRG introduces a novel framework that leverages large language models to construct personalized reasoning graphs, integrating chain reasoning, self‑verification, divergent extension, and knowledge‑base self‑improvement, thereby enhancing recommendation accuracy, interpretability, and performance across multiple benchmark datasets without additional user or item information.

AILarge Language ModelsRecommendation Systems
0 likes · 9 min read
LLMRG: Improving Recommendations through Large Language Model Reasoning Graphs