Tagged articles

AI model evaluation

11 articles · Page 1 of 1

Jun 13, 2026 · Artificial Intelligence

When Claude Went Offline, a Chinese Model Picked Up the Slack

The sudden suspension of Anthropic's Claude models sparked a surge of discussion in China's AI community, leading to the rapid release of GLM‑5.2, whose extended context window, coding‑plan subscription, and mixed performance on engineering tasks provide developers with a detailed comparative analysis against Claude Opus 4.8.

AI model evaluationClaude OpusGLM-5.2

0 likes · 10 min read

When Claude Went Offline, a Chinese Model Picked Up the Slack

Machine Learning Algorithms & Natural Language Processing

Apr 7, 2026 · Artificial Intelligence

From One Test for All to Personalized Exams: USTC’s First Survey on Computerized Adaptive Testing (TPAMI 2026)

The article reviews the first USTC survey published in TPAMI 2026, which analyzes Computerized Adaptive Testing (CAT) from a machine‑learning perspective, detailing its measurement models, selection algorithms, question‑bank construction, test‑control issues, and its emerging role in evaluating both students and AI models.

AI model evaluationComputerized Adaptive TestingEducational Measurement

0 likes · 9 min read

From One Test for All to Personalized Exams: USTC’s First Survey on Computerized Adaptive Testing (TPAMI 2026)

Machine Learning Algorithms & Natural Language Processing

Mar 10, 2026 · Artificial Intelligence

How Much Has GPT‑5.4 Improved? Hands‑On Test of Its Three Core Capabilities and Computer Control

After GPT‑5.4’s March release, the author benchmarks it against Claude Opus 4.6 and Gemini 3.1 Pro, evaluates its knowledge‑work, native computer‑control, and programming abilities through three hands‑on tasks—including data‑analysis, code‑base inspection, and a complex math‑modeling contest—revealing strong gains but still notable limitations.

AI model evaluationGPT-5.4benchmark

0 likes · 11 min read

How Much Has GPT‑5.4 Improved? Hands‑On Test of Its Three Core Capabilities and Computer Control

PaperAgent

Dec 14, 2025 · Artificial Intelligence

GPT‑5.2 vs Gemini 3 Pro: Coding Tests, NeurIPS 2025 Paper Insights, and RAG Refactor

The article evaluates GPT‑5.2 and Gemini 3 Pro on real‑world coding tasks, analyzes trends from the 6000 papers presented at NeurIPS 2025, and demonstrates how to extract and refactor the tree‑building component of the open‑source RAPTOR RAG system into an independent module.

AI model evaluationGPT-5.2Gemini 3 Pro

0 likes · 5 min read

GPT‑5.2 vs Gemini 3 Pro: Coding Tests, NeurIPS 2025 Paper Insights, and RAG Refactor

Aikesheng Open Source Community

Nov 21, 2025 · Artificial Intelligence

Gemini 3 Pro Leads SQL Benchmarks with Deep Understanding, High‑Quality Optimization, and Balanced Dialect Conversion

The SCALE evaluation shows Gemini 3 Pro topping the SQL benchmark leaderboard, achieving No.1 in SQL understanding, No.2 in optimization, and No.6 in dialect conversion, while highlighting its strengths in execution accuracy, syntax error detection, and areas needing improvement such as execution‑plan prediction and large‑SQL handling.

AI model evaluationDialect ConversionGemini 3 Pro

0 likes · 12 min read

Gemini 3 Pro Leads SQL Benchmarks with Deep Understanding, High‑Quality Optimization, and Balanced Dialect Conversion

Xiaolong Cloud Tech Team

Nov 19, 2025 · Artificial Intelligence

Gemini 3 Pro Review: 20%+ Coding Gains and Front‑End SOTA Performance

Google’s newly released Gemini 3 Pro shows more than a 20% improvement in coding ability over Gemini 2.5 Pro, achieves state‑of‑the‑art results across multiple benchmarks, and can generate complete React apps, visual designs, physics simulations, and simple games, though its new antigravity IDE currently suffers login issues.

AI model evaluationGemini 3 ProGoogle AI

0 likes · 5 min read

Gemini 3 Pro Review: 20%+ Coding Gains and Front‑End SOTA Performance

AI Tech Publishing

Nov 17, 2025 · Artificial Intelligence

Frontier AI Models in RL Environments Reveal an Agent Capability Hierarchy

The article evaluates nine cutting‑edge AI models on 150 simulated workplace tasks, showing that even the strongest models complete fewer than 40% of tasks, and uses these results to propose a hierarchical framework of agentic capabilities ranging from tool use to common‑sense reasoning.

AI model evaluationReinforcement LearningTool Use

0 likes · 19 min read

Frontier AI Models in RL Environments Reveal an Agent Capability Hierarchy

Sohu Tech Products

Sep 10, 2025 · Artificial Intelligence

Can Kimi K2 Replace Claude’s Brain? A Deep Dive into AI‑Powered Code Agents

This article evaluates whether the domestically‑developed Kimi K2 model can serve as a cost‑effective alternative brain for Claude Code, detailing step‑by‑step integration, performance tests across task accuracy, advanced feature compatibility, memory retrieval, parallel development with Git Worktree, and hook automation, concluding with strengths, limitations, and overall success.

AI model evaluationClaude CodeCost Optimization

0 likes · 18 min read

Can Kimi K2 Replace Claude’s Brain? A Deep Dive into AI‑Powered Code Agents

Aikesheng Open Source Community

Aug 20, 2025 · Artificial Intelligence

GPT‑5 Models Ranked: Which Variant Excels at SQL Tasks?

An in‑depth August 2025 benchmark evaluates GPT‑5’s mini, nano, and chat variants on SQL understanding, optimization, and dialect conversion, revealing gpt‑5‑mini’s balanced performance, gpt‑5‑nano’s strong code‑generation accuracy, and gpt‑5‑chat’s theoretical strengths but practical shortcomings, guiding scenario‑specific model selection.

AI model evaluationArtificial IntelligenceGPT-5

0 likes · 9 min read

GPT‑5 Models Ranked: Which Variant Excels at SQL Tasks?

Ops Development & AI Practice

Apr 4, 2025 · Artificial Intelligence

Decoding LLM Endpoint Features: Quantization, Tokens, and Tool Support Explained

This article breaks down the key endpoint features of large language models—such as quantization, max token limits, streaming cancellation, tool support, and reasoning ability—explaining what each term means, why it matters, and how to choose models wisely for different applications.

AI model evaluationEndpoint FeaturesLLM

0 likes · 11 min read

Decoding LLM Endpoint Features: Quantization, Tokens, and Tool Support Explained

Java Tech Enthusiast

Feb 26, 2025 · Artificial Intelligence

Claude 3.7 Sonnet: How It Crushes Coding, Physics Simulations, and Logic Puzzles

Claude 3.7 Sonnet demonstrates unprecedented programming speed, realistic physics simulation, advanced reasoning on misleading benchmarks, and strong productivity tools, while Anthropic secures a $3.5 billion funding round, making it a standout AI model in both technical capability and market impact.

AI model evaluationClaude 3.7Logic Reasoning

0 likes · 11 min read

Claude 3.7 Sonnet: How It Crushes Coding, Physics Simulations, and Logic Puzzles