Tagged articles
73 articles
Page 1 of 1
FunTester
FunTester
May 17, 2026 · Artificial Intelligence

How a Rubric‑Driven Agent Achieves More Stable Outputs

The article explains why vague expectations cause unstable Agent results, introduces Rubric as a concrete, pre‑written scoring standard for Generator‑Critic workflows, details how to design clear Yes/No criteria, organize them into Must/Should/Nice‑to‑have layers, and iteratively refine the Rubric for reliable AI output.

AI EvaluationAgentCritic
0 likes · 8 min read
How a Rubric‑Driven Agent Achieves More Stable Outputs
Old Zhang's AI Learning
Old Zhang's AI Learning
May 16, 2026 · Artificial Intelligence

Can Your PC Run Large Language Models? Meet BenchLoop, the Local Benchmarking Tool

BenchLoop is a CLI‑plus‑Web application that lets you reproducibly benchmark locally‑run LLMs across seven suites—including speed, tool‑calling, coding and agent tasks—while recording hardware details, scoring results with a weighted formula, and optionally publishing them to a public leaderboard.

AI EvaluationBenchLoopLLM benchmarking
0 likes · 14 min read
Can Your PC Run Large Language Models? Meet BenchLoop, the Local Benchmarking Tool
AI Engineering
AI Engineering
May 7, 2026 · Artificial Intelligence

Can Large Language Models Rebuild Complex Systems? ProgramBench’s Harsh Verdict

A Stanford NLP benchmark called ProgramBench tested 200 real‑world codebases and found that current large language models, including Claude and GPT‑5, achieve near‑zero success in reconstructing full systems like SQLite, FFmpeg, and a PHP compiler from binaries alone.

AI EvaluationProgramBenchcode generation benchmark
0 likes · 4 min read
Can Large Language Models Rebuild Complex Systems? ProgramBench’s Harsh Verdict
AI Explorer
AI Explorer
May 2, 2026 · Artificial Intelligence

How a New AI Probe Can Reverse‑Engineer LLM Parameter Counts

Researcher Li Bojie’s “Uncompressible Knowledge Probe” uses random, black‑box API queries to gauge how much irreducible knowledge a large language model retains, allowing an indirect estimate of its effective parameter count and prompting a broader debate on model evaluation and transparency.

AI EvaluationLLMblack-box testing
0 likes · 5 min read
How a New AI Probe Can Reverse‑Engineer LLM Parameter Counts
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 29, 2026 · Artificial Intelligence

Top 10 Open‑Source LLM Benchmarks: Scores, Rankings, and What They Test

This article walks through ten mainstream open‑source large‑model benchmarks—SWE‑bench Verified and Pro, MMLU‑Pro, GPQA Diamond, HLE, AIME, HMMT, olmOCR‑bench, Terminal‑Bench 2.0, and EvasionBench—explaining their data, evaluation metrics, current leading models, and the capability dimensions they reveal.

AI EvaluationLLM benchmarksMMLU-Pro
0 likes · 20 min read
Top 10 Open‑Source LLM Benchmarks: Scores, Rankings, and What They Test
ZhiKe AI
ZhiKe AI
Apr 28, 2026 · Artificial Intelligence

Demystifying DeepSeek‑V4 Benchmarks with Real‑World Data

This article breaks down DeepSeek‑V4's six core capability categories—knowledge, reasoning, programming, math, long‑context, and agent—showing how each benchmark works, presenting concrete scores that place V4 first or second against leading models, and explaining the hidden efficiency gains that make V4 up to 13.7× cheaper to run.

AI EvaluationBenchmarkDeepSeek-V4
0 likes · 14 min read
Demystifying DeepSeek‑V4 Benchmarks with Real‑World Data
AI Tech Publishing
AI Tech Publishing
Apr 27, 2026 · Artificial Intelligence

Why Build Your Own AI Evaluation Harness? 7 OpenAI‑Inspired Recommendations

The article explains why generic AI testing platforms fall short, outlines how to design a testable AI system from day one, and presents seven practical recommendations—from using Codex or Claude Code to manage regression and iteration test sets, to leveraging entropy diagnostics and custom domain‑expert UX.

AI EvaluationEvaluation FrameworkOpenAI
0 likes · 8 min read
Why Build Your Own AI Evaluation Harness? 7 OpenAI‑Inspired Recommendations
SuanNi
SuanNi
Apr 21, 2026 · Artificial Intelligence

How Qwen3.6‑35B‑A3B Matches Dense Models with Only 30 B Active Parameters

The article analyzes Qwen3.6‑35B‑A3B’s MoE architecture, showing how its 30 B active parameters outperform larger dense models across programming, agent, and multimodal benchmarks, and examines the flagship Qwen3.6‑Max‑Preview’s substantial gains in world knowledge, instruction following, and third‑party rankings.

AI EvaluationBenchmarkMixture of Experts
0 likes · 5 min read
How Qwen3.6‑35B‑A3B Matches Dense Models with Only 30 B Active Parameters
SuanNi
SuanNi
Apr 19, 2026 · Artificial Intelligence

Why Multimodal Video Models Still Miss the Mark: Inside the New Video‑MME‑v2 Benchmark

The Video‑MME‑v2 benchmark reveals that current multimodal video models, despite high leaderboard scores, struggle with genuine video understanding, thanks to a rigorous three‑layer evaluation, non‑linear scoring, and a meticulously curated 800‑video dataset that exposes their true intelligence limits.

AI EvaluationVideo-MMElarge language models
0 likes · 10 min read
Why Multimodal Video Models Still Miss the Mark: Inside the New Video‑MME‑v2 Benchmark
IT Services Circle
IT Services Circle
Apr 14, 2026 · Artificial Intelligence

What Is RAG? A Complete Guide to Retrieval‑Augmented Generation for AI Engineers

This article explains Retrieval‑Augmented Generation (RAG), covering why large language models need external knowledge, the full offline‑and‑online workflow, document chunking, embedding evolution, vector database choices, multi‑path retrieval, evaluation metrics, hallucination types, and practical strategies to mitigate them.

AI EvaluationEmbeddingRAG
0 likes · 55 min read
What Is RAG? A Complete Guide to Retrieval‑Augmented Generation for AI Engineers
Machine Heart
Machine Heart
Apr 13, 2026 · Artificial Intelligence

Why the Top Video Model Scores Only 49: Introducing Video‑MME‑v2 by Nanjing University

The new Video‑MME‑v2 benchmark reveals that despite saturated high scores on existing video‑understanding tests, the strongest commercial model (Gemini‑3‑Pro) reaches only 49.4 points versus a human expert’s 90.7, highlighting the benchmark’s layered ability system, group‑level non‑linear scoring, and the nuanced impact of "Thinking" features.

AI Evaluationlarge modelsmultimodal benchmark
0 likes · 11 min read
Why the Top Video Model Scores Only 49: Introducing Video‑MME‑v2 by Nanjing University
PMTalk Product Manager Community
PMTalk Product Manager Community
Apr 10, 2026 · Artificial Intelligence

Why AI Product Evaluation Is Hard and How to Build a Scientific Assessment Framework

The article analyzes the unique challenges of evaluating AI products—output uncertainty, subjective criteria, over‑fitting risk, high cost, and vague metrics—compares traditional testing with AI testing, proposes a five‑step evaluation workflow, defines concrete metrics such as pass rate and efficiency gain, and illustrates the process with a real‑world sales‑script generation case study, concluding with five key success factors and future trends.

AI EvaluationAutomationCase Study
0 likes · 13 min read
Why AI Product Evaluation Is Hard and How to Build a Scientific Assessment Framework
SuanNi
SuanNi
Mar 23, 2026 · Artificial Intelligence

Can LLMs Predict Real‑World War Outcomes? A Deep Dive into the 2026 Middle East Conflict

A research team from MBZUAI and the University of Maryland constructed an 11‑point timeline of the 2026 Middle East escalation, fed contemporaneous news to leading large language models, and evaluated their strategic reasoning, economic impact forecasts, and political signal interpretation, revealing both strengths and limitations of AI under extreme uncertainty.

AI EvaluationGeopoliticsLLM
0 likes · 12 min read
Can LLMs Predict Real‑World War Outcomes? A Deep Dive into the 2026 Middle East Conflict
Sohu Tech Products
Sohu Tech Products
Mar 19, 2026 · Artificial Intelligence

Testing GLM‑5 Turbo: From AutoClaw Integration to a Browser‑Based War3 Clone

This article walks through a hands‑on evaluation of the GLM‑5 Turbo model, detailing its integration with AutoClaw for rapid Feishu bot deployment, comparing its performance against a baseline model on OpenClaw data‑dashboard tasks, and showcasing a fully client‑side War3‑style RTS built in a single HTML file.

AI EvaluationAgent EngineAutoClaw
0 likes · 23 min read
Testing GLM‑5 Turbo: From AutoClaw Integration to a Browser‑Based War3 Clone
Aikesheng Open Source Community
Aikesheng Open Source Community
Mar 9, 2026 · Artificial Intelligence

Why Traditional AI Benchmarks Fail and How SCALE Redefines SQL LLM Evaluation

The article examines the shortcomings of conventional AI evaluation methods, introduces the concept of an "unknown" risk in production settings, and presents SCALE—a continuously updated, high‑fidelity benchmark that stresses large‑model SQL capabilities with real‑world incident data and mixed objective‑subjective scoring.

AI EvaluationModel SelectionSQL benchmark
0 likes · 11 min read
Why Traditional AI Benchmarks Fail and How SCALE Redefines SQL LLM Evaluation
AI Engineering
AI Engineering
Mar 6, 2026 · Artificial Intelligence

Anthropic Adds a Full Evaluation Framework to Skill Creator

Anthropic's latest Skill Creator update introduces a code‑free evaluation framework that lets non‑engineer skill authors run tests, benchmark regressions, and optimize trigger descriptions, while supporting parallel multi‑agent execution and A/B comparisons to keep skills reliable as models evolve.

AI EvaluationAnthropicBenchmarking
0 likes · 8 min read
Anthropic Adds a Full Evaluation Framework to Skill Creator
PaperAgent
PaperAgent
Mar 6, 2026 · Artificial Intelligence

BeyondSWE: Rethinking Code Agent Benchmarks with Real‑World Multi‑Repo Challenges

BeyondSWE expands code‑agent evaluation beyond single‑repo bug fixing by introducing four realistic scenarios, scaling to 246 repositories and 500 samples, revealing a sharp performance drop for top models and highlighting the nuanced impact of search‑augmented agents like SearchSWE.

AI EvaluationBeyondSWESearchSWE
0 likes · 6 min read
BeyondSWE: Rethinking Code Agent Benchmarks with Real‑World Multi‑Repo Challenges
Aikesheng Open Source Community
Aikesheng Open Source Community
Mar 2, 2026 · Artificial Intelligence

Why Traditional AI Benchmarks Fail and How SCALE Redefines SQL Model Evaluation

The article argues that conventional AI evaluation metrics miss critical unknown risks, outlines three key challenges in AI model selection for database tasks, introduces the SCALE benchmark with real‑world incident data, and explains its mixed evaluation framework that combines objective, subjective, and performance‑driven assessments to guide tech leaders toward reliable SQL‑focused AI solutions.

AI EvaluationModel SelectionPerformance Testing
0 likes · 10 min read
Why Traditional AI Benchmarks Fail and How SCALE Redefines SQL Model Evaluation
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 12, 2026 · Artificial Intelligence

Fast Generation, Weak Intelligence? The Harsh Reality of Diffusion Models for Agents

A comprehensive evaluation shows that while diffusion language models achieve higher generation speed through parallel decoding, they suffer from severe causal reasoning and formatting deficiencies, lagging far behind autoregressive models on embodied and tool‑calling agent tasks.

AI EvaluationAutoregressive ModelsTool Calling
0 likes · 8 min read
Fast Generation, Weak Intelligence? The Harsh Reality of Diffusion Models for Agents
DataFunTalk
DataFunTalk
Feb 12, 2026 · Artificial Intelligence

DeepSeek’s New Model V4? Exploring 1M‑Token Context and Updated Knowledge

DeepSeek quietly launched its latest model, reportedly supporting up to 1 million tokens, extending its knowledge cutoff to May 2025, adopting a more enthusiastic response style, and still operating as a pure‑text system, while early tests showcase impressive coding and reasoning capabilities.

AI EvaluationDeepSeekknowledge cutoff
0 likes · 5 min read
DeepSeek’s New Model V4? Exploring 1M‑Token Context and Updated Knowledge
DataFunTalk
DataFunTalk
Jan 21, 2026 · Artificial Intelligence

Why Traditional Coding Benchmarks Miss the Mark: Inside OctoCodingBench’s Process‑Level Evaluation

The article examines the rapid progress of AI coding agents, critiques existing benchmarks that only measure final correctness, and introduces OctoCodingBench—a new suite that simulates real‑world constraints, records full interaction traces, and evaluates both task success and strict process compliance across multiple languages.

AI EvaluationLLM-as-judgecoding agents
0 likes · 10 min read
Why Traditional Coding Benchmarks Miss the Mark: Inside OctoCodingBench’s Process‑Level Evaluation
Programmer DD
Programmer DD
Jan 12, 2026 · Artificial Intelligence

5 Counterintuitive Lessons for Evaluating AI Agents Effectively

This article shares five surprising, high‑impact lessons from Anthropic on building robust AI agent evaluation suites, covering early failure‑case collections, recognizing clever “failures,” focusing on outcomes over process, choosing the right success metrics, and the irreplaceable value of human review.

AI EvaluationAnthropicagent testing
0 likes · 10 min read
5 Counterintuitive Lessons for Evaluating AI Agents Effectively
PaperAgent
PaperAgent
Jan 10, 2026 · Artificial Intelligence

How to Build Robust Evaluations for AI Agents: A Complete Roadmap

Anthropic’s new blog reveals a comprehensive framework for evaluating AI agents, detailing evaluation structures, metrics like pass@k and pass^k, types of scorers, multi‑round testing, and a step‑by‑step roadmap for designing, maintaining, and integrating automated assessments into agent development pipelines.

AI EvaluationAI agentsEvaluation Framework
0 likes · 15 min read
How to Build Robust Evaluations for AI Agents: A Complete Roadmap
Architecture and Beyond
Architecture and Beyond
Jan 10, 2026 · Artificial Intelligence

How to Systematically Test and Evaluate Industry AI Agents

This guide explains how to systematically evaluate industry‑specific AI agents by testing the combined model and engineering stack, building domain‑expert‑driven datasets, designing reproducible testing systems, managing assets, controlling costs, and applying both traditional and LLM‑based methods to ensure reliable, stable performance.

AI EvaluationLLM testingagent testing
0 likes · 20 min read
How to Systematically Test and Evaluate Industry AI Agents
HyperAI Super Neural
HyperAI Super Neural
Dec 18, 2025 · Artificial Intelligence

GPT-5 Leads as OpenAI Unveils FrontierScience: Dual‑Track Reasoning and Research Benchmark

OpenAI's FrontierScience benchmark, released on Dec 16, 2025, evaluates expert‑level scientific reasoning and research tasks, showing GPT‑5.2 scoring 25% on Olympiad and 77% on Research, outperforming other models while highlighting strengths in closed‑form problems and gaps in open‑ended research tasks.

AI EvaluationBenchmarkFrontierScience
0 likes · 10 min read
GPT-5 Leads as OpenAI Unveils FrontierScience: Dual‑Track Reasoning and Research Benchmark
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 17, 2025 · Artificial Intelligence

How to Build a Scalable AI Evaluation Platform for Rapid Product Iteration

This article outlines the challenges of AI product testing, proposes a comprehensive evaluation framework covering business goals, product effectiveness, performance, safety, and cost, and details the design of a modular, end‑to‑end testing platform that supports both reference‑based and reference‑free assessments while enabling continuous quality improvement.

AI EvaluationQuality EngineeringTesting framework
0 likes · 19 min read
How to Build a Scalable AI Evaluation Platform for Rapid Product Iteration
AI Frontier Lectures
AI Frontier Lectures
Dec 9, 2025 · Artificial Intelligence

CrossVid: The New Benchmark Exposing AI’s Struggle with Cross‑Video Reasoning

CrossVid is an open‑source benchmark that evaluates multimodal large language models on cross‑video reasoning, offering 5,331 videos and 9,015 high‑quality QA pairs across four reasoning dimensions, and revealing that even the strongest models achieve only about 50% accuracy compared with human performance.

AI Evaluationcross-video reasoningvideo understanding
0 likes · 9 min read
CrossVid: The New Benchmark Exposing AI’s Struggle with Cross‑Video Reasoning
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Dec 4, 2025 · Artificial Intelligence

CrossVid: A New Benchmark Reveals the Limits of Multimodal LLMs in Cross‑Video Reasoning

CrossVid is an open‑source benchmark that evaluates multimodal large language models on cross‑video reasoning tasks, providing 5,331 videos, 9,015 QA pairs, four high‑level dimensions and ten specific tasks, and exposing significant performance gaps between current models and humans.

AI Evaluationcross-video reasoningmultimodal LLM
0 likes · 9 min read
CrossVid: A New Benchmark Reveals the Limits of Multimodal LLMs in Cross‑Video Reasoning
Meituan Technology Team
Meituan Technology Team
Nov 27, 2025 · Artificial Intelligence

AMO‑Bench: A New High‑Difficulty, Original Math Reasoning Benchmark for LLMs

AMO‑Bench, released by Meituan's LongCat team, is a 50‑question, IMO‑level math reasoning benchmark that combines original, high‑difficulty problems with automated scoring, exposing the current limits of top large language models whose best accuracy hovers around 52 % and offering a more discriminative evaluation tool for future model improvements.

AI EvaluationAMO-BenchBenchmark
0 likes · 12 min read
AMO‑Bench: A New High‑Difficulty, Original Math Reasoning Benchmark for LLMs
Radish, Keep Going!
Radish, Keep Going!
Nov 15, 2025 · Artificial Intelligence

Can Google’s New Model Finally Crack Handwritten History and Symbolic Reasoning?

A historian’s experiment with a secret Google AI model shows near‑expert transcription of 18th‑century ledgers and multi‑step reasoning that may signal a breakthrough in both handwritten OCR and symbolic inference, sparking a heated debate on Hacker News about true understanding versus advanced pattern matching.

AI EvaluationGemini 3Hacker News debate
0 likes · 12 min read
Can Google’s New Model Finally Crack Handwritten History and Symbolic Reasoning?
HyperAI Super Neural
HyperAI Super Neural
Nov 6, 2025 · Industry Insights

Three 22‑Year‑Old Dropouts Disrupt AI Recruiting, Landing $10 B Valuation in Two Years

Mercor, founded by three 22‑year‑old college dropouts, raised a $350 million Series C round that lifted its valuation to $10 billion within two years, built an AI‑powered recruiting platform serving OpenAI, Meta, Google and others, launched the APEX benchmark to measure economic value of AI models, and survived intense work‑culture pressures, a legal dispute, and rapid team changes.

AI EvaluationAI recruitingAPEX benchmark
0 likes · 18 min read
Three 22‑Year‑Old Dropouts Disrupt AI Recruiting, Landing $10 B Valuation in Two Years
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Oct 30, 2025 · Artificial Intelligence

FinSearchComp: ByteDance’s Expert‑Level Financial Search and Reasoning Benchmark for Real‑World Scenarios

FinSearchComp is the first fully open‑source benchmark that evaluates large‑language‑model agents' search and reasoning abilities in realistic financial workflows, featuring 635 expert‑annotated questions across three task types, built with 70 finance experts, and revealing that web‑enabled models with financial plugins markedly outperform API‑only models.

AI EvaluationBenchmarkFinSearchComp
0 likes · 12 min read
FinSearchComp: ByteDance’s Expert‑Level Financial Search and Reasoning Benchmark for Real‑World Scenarios
Baidu Tech Salon
Baidu Tech Salon
Oct 24, 2025 · Artificial Intelligence

How Wenxin X1.1 Tops China’s LLMs on the New SuperCLUE-CPIF Benchmark

Recent release of the SuperCLUE-CPIF benchmark shows Baidu’s Wenxin X1.1 achieving the highest score among Chinese large language models, surpassing competitors like DeepSeek‑V3.2‑Exp‑Thinking and Hunyuan‑T1, with notable advantages in precise instruction following and complex task handling.

AI EvaluationBenchmarkWenxin X1.1
0 likes · 4 min read
How Wenxin X1.1 Tops China’s LLMs on the New SuperCLUE-CPIF Benchmark
DataFunTalk
DataFunTalk
Oct 22, 2025 · Artificial Intelligence

Introducing VitaBench: A Real-World Benchmark for Complex LLM Agents

VitaBench is a newly released, highly realistic benchmark that evaluates large‑language‑model agents across three everyday scenarios—food ordering, restaurant dining, and travel planning—by quantifying reasoning, tool‑use, and interaction complexities, revealing a significant performance gap in current models.

AI EvaluationBenchmarkLLM agents
0 likes · 13 min read
Introducing VitaBench: A Real-World Benchmark for Complex LLM Agents
Ops Development & AI Practice
Ops Development & AI Practice
Sep 16, 2025 · Artificial Intelligence

Why the “Bash Only” Benchmark Is the Toughest Test for AI Code Agents

This article examines the design philosophy behind the “Bash Only” category of the SWE‑bench benchmark, explaining how its minimal‑agent approach isolates LLM reasoning by restricting interactions to a plain Bash shell, making it a rigorous, reproducible test of true software‑engineering intelligence.

AI EvaluationBash OnlyBenchmark
0 likes · 7 min read
Why the “Bash Only” Benchmark Is the Toughest Test for AI Code Agents
Data Party THU
Data Party THU
Jul 28, 2025 · Artificial Intelligence

AI’s Shift from Gold Medals to Cost‑Effective Quantitative Success

Terence Tao highlights that AI is transitioning from achieving headline‑making qualitative milestones, like winning IMO‑level contests, to a phase where quantitative metrics—resource costs, success rates, and scalability—must be transparently reported, urging standardized benchmarks and careful comparison between lightweight and heavyweight AI systems.

AI Evaluationartificial intelligencecost efficiency
0 likes · 8 min read
AI’s Shift from Gold Medals to Cost‑Effective Quantitative Success
AntTech
AntTech
Jul 3, 2025 · Artificial Intelligence

How Ant Group’s AI Multimodal Evaluation Transforms Image, Speech, and Video Quality Testing

In a QECon 2025 talk, Ant Group’s AI team detailed a comprehensive multimodal evaluation framework that leverages large‑model metrics, custom pipelines, and benchmark datasets to assess image generation, speech recognition, and video quality, while also contributing to industry standards and academic research.

AI Evaluationimage assessmentlarge models
0 likes · 16 min read
How Ant Group’s AI Multimodal Evaluation Transforms Image, Speech, and Video Quality Testing
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 26, 2025 · Artificial Intelligence

How to Build a Multi‑Dimensional Evaluation Framework for AI‑Powered Data Analysis Platforms

This article outlines the design of a scientific, quantifiable, multi‑dimensional evaluation system for the DataV‑Note intelligent analysis platform, addressing the lack of unified standards and accuracy challenges in AI‑driven data reporting, and proposes concrete metrics, model architecture, and future automation plans.

AI EvaluationModel Designdata analysis
0 likes · 13 min read
How to Build a Multi‑Dimensional Evaluation Framework for AI‑Powered Data Analysis Platforms
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 23, 2025 · Artificial Intelligence

How to Systematically Conduct Large Model Evaluation in Real-World Scenarios

This guide walks readers through a complete, business‑oriented workflow for evaluating large language models—from requirement analysis and test‑set design to metric definition, execution, result aggregation, and report generation—while addressing common challenges such as data imbalance, annotation quality, and automation.

AI EvaluationBenchmarkingReporting
0 likes · 24 min read
How to Systematically Conduct Large Model Evaluation in Real-World Scenarios
DataFunTalk
DataFunTalk
Jun 18, 2025 · Artificial Intelligence

Can LLMs Really Beat Human Olympiad Programmers? Insights from LiveCodeBench Pro

This article examines the LiveCodeBench Pro benchmark, revealing that while large language models achieve impressive scores on knowledge‑ and logic‑heavy coding problems, they still fall short of human experts on high‑difficulty, observation‑intensive tasks, especially without external tool support.

AI EvaluationBenchmarkCode Generation
0 likes · 11 min read
Can LLMs Really Beat Human Olympiad Programmers? Insights from LiveCodeBench Pro
Architect
Architect
Jun 12, 2025 · Artificial Intelligence

Why Large Reasoning Models Collapse Under Complex Tasks: Insights from Apple’s Study

Apple’s research reveals that large reasoning models, despite sophisticated self‑reflection mechanisms, experience a complete performance collapse when problem complexity exceeds a threshold, highlighting fundamental limits in their ability to achieve generalized reasoning.

AI EvaluationToken efficiencylarge reasoning models
0 likes · 7 min read
Why Large Reasoning Models Collapse Under Complex Tasks: Insights from Apple’s Study
Eric Tech Circle
Eric Tech Circle
May 6, 2025 · Artificial Intelligence

How to Deploy Qwen3-30B-A3B Locally and Unlock Its Full AI Potential

This article walks through the complete process of installing the Qwen3-30B-A3B large language model on a personal computer using LM Studio, evaluates its reasoning, creative, multilingual, and coding abilities with detailed prompts, and shares practical tips for optimizing local deployment and prompt design.

AI EvaluationLM StudioPrompt engineering
0 likes · 12 min read
How to Deploy Qwen3-30B-A3B Locally and Unlock Its Full AI Potential
AIWalker
AIWalker
Mar 31, 2025 · Artificial Intelligence

VBench-2.0: A Next‑Generation Benchmark for Intrinsic Faithfulness in AI Video Generation

VBench-2.0 expands the original VBench suite by introducing six fine‑grained dimensions—Human Fidelity, Controllability, Creativity, Physics, Commonsense, and more—to evaluate not only the visual quality of generated videos but also their intrinsic faithfulness to physical laws, common sense, and narrative coherence, providing open‑source tools, prompts, and human‑aligned metrics for the research community.

AI EvaluationBenchmarkIntrinsic Faithfulness
0 likes · 12 min read
VBench-2.0: A Next‑Generation Benchmark for Intrinsic Faithfulness in AI Video Generation
Nightwalker Tech
Nightwalker Tech
Mar 28, 2025 · Artificial Intelligence

Comprehensive Evaluation of GPT-4o Multimodal Image Generation Capabilities

This article presents a thorough assessment of GPT‑4o’s new image generation features, detailing multiple test scenarios—from simple portrait creation and style transfer to UI design, product rendering, and educational illustrations—comparing its output with Claude‑3.7‑Sonnet, highlighting strengths in realism and weaknesses in Chinese text handling.

AI EvaluationGPT-4oimage generation
0 likes · 16 min read
Comprehensive Evaluation of GPT-4o Multimodal Image Generation Capabilities
Model Perspective
Model Perspective
Mar 23, 2025 · Artificial Intelligence

How to Quantify AI’s Role in Mathematical Modeling with a Contribution Index

This article proposes an AI Contribution Index for mathematical modeling, explains its weighted‑average construction, provides concrete formulas and examples, and discusses broader applications and philosophical implications of quantifying AI involvement across various stages of problem solving.

AI EvaluationAI contributionmathematical modeling
0 likes · 6 min read
How to Quantify AI’s Role in Mathematical Modeling with a Contribution Index
AntTech
AntTech
Feb 26, 2025 · Artificial Intelligence

Ant Group’s 18 Accepted Papers at AAAI 2025: Summaries and Highlights

This article presents concise English summaries of the 18 Ant Group papers accepted at AAAI 2025, covering topics such as privacy‑preserving large‑model tuning, knowledge‑graph integration, AI‑generated image detection, multi‑task learning, generative retrieval, role‑playing evaluation, and video hallucination mitigation.

AAAI 2025AI EvaluationGenerative Retrieval
0 likes · 29 min read
Ant Group’s 18 Accepted Papers at AAAI 2025: Summaries and Highlights
AIWalker
AIWalker
Jan 18, 2025 · Artificial Intelligence

How InternLM 3.0 Achieves High Performance with Just 4 TB of Training Data

Shanghai AI Laboratory’s InternLM 3.0 upgrade demonstrates that a refined 4 TB token dataset can boost a large‑language model’s performance beyond that of open‑source peers trained on 18 TB, cutting training cost by over 75% while merging regular dialogue with deep reasoning capabilities.

AI EvaluationInternLMdata efficiency
0 likes · 9 min read
How InternLM 3.0 Achieves High Performance with Just 4 TB of Training Data
Alimama Tech
Alimama Tech
Dec 25, 2024 · Artificial Intelligence

WiS Platform: Evaluating LLM Multi-Agent Systems via Game-Based Analysis

The WiS Platform provides a game‑based environment for benchmarking large language models in multi‑agent settings, measuring reasoning, deception and collaboration through dynamic scenarios, offering fair experimental design, real‑time competition, visualizations, detailed metrics, and open‑source tools, with GPT‑4o outperforming other models such as Qwen2.5‑72B‑Instruct.

AI EvaluationDefense StrategiesGame-Based Testing
0 likes · 8 min read
WiS Platform: Evaluating LLM Multi-Agent Systems via Game-Based Analysis
Volcano Engine Developer Services
Volcano Engine Developer Services
Dec 10, 2024 · Artificial Intelligence

Introducing FullStack Bench: Multi‑Language Code LLM Benchmark & SandboxFusion

The article presents FullStack Bench, a newly open‑sourced, multi‑language code‑LLM evaluation dataset covering over 11 real‑world programming scenarios and 16 languages, along with the SandboxFusion execution environment, and reports comprehensive benchmark results that highlight the superiority of closed‑source models over most open‑source alternatives.

AI EvaluationBenchmarkCode LLM
0 likes · 11 min read
Introducing FullStack Bench: Multi‑Language Code LLM Benchmark & SandboxFusion
DataFunSummit
DataFunSummit
Dec 3, 2024 · Artificial Intelligence

Applying Large Language Models to NPC Role‑Playing and Game Localization at Tencent

This article details Tencent's practical exploration of large language model deployment in overseas game scenarios, covering the design of customized NPC role‑playing models, multilingual localization pipelines, data construction, training, evaluation frameworks, multi‑agent improvement loops, and insights from a comprehensive Q&A session.

AI EvaluationNPC AITencent
0 likes · 17 min read
Applying Large Language Models to NPC Role‑Playing and Game Localization at Tencent
Kuaishou Tech
Kuaishou Tech
Sep 20, 2024 · Artificial Intelligence

Building an LLM-Based Agent Platform for Enterprise Commercialization: Strategies, Architecture, and Practical Insights

This article details the strategic development and technical architecture of SalesCopilot, an LLM-driven agent platform designed for enterprise commercialization, highlighting the implementation of RAG and agent technologies, addressing practical challenges, and sharing key insights for building scalable AI applications.

AI EvaluationAI agentsEnterprise AI
0 likes · 15 min read
Building an LLM-Based Agent Platform for Enterprise Commercialization: Strategies, Architecture, and Practical Insights
CSS Magic
CSS Magic
Sep 14, 2024 · Artificial Intelligence

Why OpenAI’s New o1 Model Outperforms Its Rivals

The article examines OpenAI’s newly released o1 model, highlighting its superior performance in complex reasoning tasks such as math, programming, and science, and explains how model‑level chain‑of‑thought optimization and product‑level UI design give it an edge over competitors like Claude.

AI EvaluationChatGPTOpenAI
0 likes · 8 min read
Why OpenAI’s New o1 Model Outperforms Its Rivals
Java Tech Enthusiast
Java Tech Enthusiast
Jul 16, 2024 · Artificial Intelligence

LLMs Misjudge Simple Number Comparison: 9.11 vs 9.9

Recent tests reveal that popular large language models—including GPT‑4o, Gemini Advanced, and Claude 3.5—often claim 9.11 is larger than 9.9 because their tokenizers split the numbers, but rephrasing, zero‑shot chain‑of‑thought prompts, or treating the values as floating‑point numbers can correct the mistake, a pattern also seen variably in Chinese models.

AI EvaluationLLMPrompt engineering
0 likes · 7 min read
LLMs Misjudge Simple Number Comparison: 9.11 vs 9.9
DataFunSummit
DataFunSummit
Jul 6, 2024 · Artificial Intelligence

Synergy Between Large Language Models and Knowledge Graphs: Recent Advances, Evaluation, and Future Integration

This article reviews the rapid progress of large language models and their complementary relationship with knowledge graphs, covering comparative strengths, knowledge extraction and completion, evaluation benchmarks, deployment benefits, complex reasoning support, and prospects for interactive fusion toward more reliable and explainable AI systems.

AI EvaluationKnowledge Graphsknowledge extraction
0 likes · 12 min read
Synergy Between Large Language Models and Knowledge Graphs: Recent Advances, Evaluation, and Future Integration
DataFunSummit
DataFunSummit
Apr 13, 2024 · Artificial Intelligence

Understanding and Mitigating Hallucinations in Large Language Model Industry Q&A with Knowledge Graphs

This article examines why large language models often produce hallucinations in industry question‑answering, defines the phenomenon, explores its data and training origins, proposes evaluation metrics, and presents practical strategies—including high‑quality fine‑tuning data, honest refusal mechanisms, advanced decoding methods, and external knowledge‑graph augmentation—to reduce hallucinations and improve reliability.

AI EvaluationKnowledge Graphhallucination
0 likes · 21 min read
Understanding and Mitigating Hallucinations in Large Language Model Industry Q&A with Knowledge Graphs
Huolala Tech
Huolala Tech
Apr 3, 2024 · Artificial Intelligence

How Huolala Built an End‑to‑End AI Evaluation Platform for Logistics

This article explains how Huolala designed and implemented a one‑stop AI evaluation platform—Lala Zhiping—to select and assess large language models for logistics scenarios, detailing its business background, architecture, configurable workflow, data isolation, permission system, and future development plans.

AI EvaluationData IsolationSystem Architecture
0 likes · 11 min read
How Huolala Built an End‑to‑End AI Evaluation Platform for Logistics
21CTO
21CTO
Feb 20, 2024 · Artificial Intelligence

Which LLM Dominates Coding? GPT‑4 vs CodeLlama vs Mixtral vs Gemini

This article presents a head‑to‑head evaluation of four leading large language models—GPT‑4, CodeLlama 70B, CodeLlama 7B, and Mixtral 8x7B—across eight coding‑related tasks, revealing GPT‑4 as the overall winner while highlighting the trade‑offs of smaller models and emerging competitors like Google Gemini.

AI EvaluationCodeLlamaCoding Assistant
0 likes · 9 min read
Which LLM Dominates Coding? GPT‑4 vs CodeLlama vs Mixtral vs Gemini
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Jan 3, 2024 · Artificial Intelligence

Llama 2: Open Foundation and Fine‑Tuned Chat Models – Ghost Attention, RLHF Results, and Safety Evaluation

This article summarizes the Llama 2 series, describing the Ghost Attention technique for maintaining system‑message consistency across multi‑turn dialogs, presenting RLHF and human evaluation results, and discussing extensive safety pre‑training, benchmark assessments, and model release details.

AI EvaluationGhost AttentionLlama-2
0 likes · 20 min read
Llama 2: Open Foundation and Fine‑Tuned Chat Models – Ghost Attention, RLHF Results, and Safety Evaluation
DataFunTalk
DataFunTalk
Aug 9, 2023 · Artificial Intelligence

Key Technologies for Domain‑Specific Large Models: Insights from the World AI Conference

This report, based on Professor Xiao Yanghua’s presentation at the World AI Conference, examines why vertical domains need general large models, outlines their key capabilities such as open‑world understanding, combinatorial innovation, evaluation, complex instruction execution, task planning, and symbolic reasoning, and discusses current limitations and optimization strategies for domain‑specific deployment.

AI EvaluationModel OptimizationVertical AI
0 likes · 17 min read
Key Technologies for Domain‑Specific Large Models: Insights from the World AI Conference
Baidu Tech Salon
Baidu Tech Salon
Aug 8, 2023 · Artificial Intelligence

Tsinghua University Report Ranks Baidu Wenxin Yiyan First Among Chinese Large Language Models

A Tsinghua University evaluation of seven large language models found Baidu’s Wenxin Yiyan topping the domestic rankings with the highest overall score across 20 metrics—especially Chinese semantic understanding and safety—surpassing ChatGPT and tying GPT‑4, while also demonstrating rapid training, inference speed, and broad industry adoption.

AI EvaluationBaidu WenxinChinese NLP
0 likes · 4 min read
Tsinghua University Report Ranks Baidu Wenxin Yiyan First Among Chinese Large Language Models
php Courses
php Courses
Aug 2, 2023 · Artificial Intelligence

Stanford and UC Berkeley Study Finds Significant Decline in GPT-4 Capabilities Across Math, Coding, and Visual Reasoning

A joint Stanford and UC Berkeley study reveals that GPT‑4’s performance on mathematics, code generation, and visual‑reasoning tasks sharply declined between March and June 2023, with accuracy dropping from 97.6% to 2.4% on a prime‑checking benchmark and executable code rates falling from 52% to 10%.

AI EvaluationGPT-4machine learning
0 likes · 3 min read
Stanford and UC Berkeley Study Finds Significant Decline in GPT-4 Capabilities Across Math, Coding, and Visual Reasoning
DataFunTalk
DataFunTalk
Mar 27, 2023 · Artificial Intelligence

GPT-4 Shows Early Signs of Artificial General Intelligence: Insights from the "Sparks of AGI" Paper

A recent 154‑page Microsoft paper titled "Sparks of Artificial General Intelligence: Early Experiments with GPT‑4" argues that GPT‑4, despite being an early prototype, already exhibits many capabilities—multimodal reasoning, programming, mathematics, and human‑like interaction—suggesting it may be an early form of AGI, though experts highlight significant limitations and ongoing debates.

AI EvaluationArtificial General IntelligenceGPT-4
0 likes · 15 min read
GPT-4 Shows Early Signs of Artificial General Intelligence: Insights from the "Sparks of AGI" Paper
IT Services Circle
IT Services Circle
Feb 7, 2023 · Artificial Intelligence

ChatGPT’s Bug‑Fixing Ability Reaches State‑of‑the‑Art on the QuixBugs Benchmark

Researchers from Germany and the UK evaluated ChatGPT and three other AI models on the QuixBugs benchmark, finding that ChatGPT correctly fixed 31 of 40 bugs—outperforming CodeX, CoCoNut, and Standard APR—and sparked mixed reactions about its impact on software engineering and OpenAI’s broader strategies.

AI EvaluationChatGPTQuixBugs
0 likes · 8 min read
ChatGPT’s Bug‑Fixing Ability Reaches State‑of‑the‑Art on the QuixBugs Benchmark
DataFunTalk
DataFunTalk
Dec 28, 2021 · Artificial Intelligence

Evaluation Framework and Methodology for OPPO XiaoBu AI Assistant

This article presents a comprehensive evaluation framework for OPPO's XiaoBu AI assistant, covering evaluation concepts, objectives, five key elements, sampling methods, dimension selection, annotation scoring, report generation, and a detailed Q&A that illustrates practical metrics and processes for voice and search services.

AI EvaluationOPPOTesting Methodology
0 likes · 23 min read
Evaluation Framework and Methodology for OPPO XiaoBu AI Assistant
DataFunSummit
DataFunSummit
Dec 27, 2021 · Artificial Intelligence

Evaluation Framework and Methodology for OPPO XiaoBu AI Assistant

This article presents a comprehensive evaluation framework for OPPO's XiaoBu AI assistant, covering the concept and purpose of evaluation, the five key evaluation elements, data sampling strategies, dimension and rule selection, annotation scoring, reporting guidelines, and detailed procedures for assessing wake‑up, ASR, NLU, and TTS performance.

AI EvaluationReportingVoice Assistant
0 likes · 20 min read
Evaluation Framework and Methodology for OPPO XiaoBu AI Assistant