Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 7, 2026 · Artificial Intelligence

From One Test for All to Personalized Exams: USTC’s First Survey on Computerized Adaptive Testing (TPAMI 2026)

The article reviews the first USTC survey published in TPAMI 2026, which analyzes Computerized Adaptive Testing (CAT) from a machine‑learning perspective, detailing its measurement models, selection algorithms, question‑bank construction, test‑control issues, and its emerging role in evaluating both students and AI models.

AI model evaluationComputerized Adaptive TestingEducational Measurement
0 likes · 9 min read
From One Test for All to Personalized Exams: USTC’s First Survey on Computerized Adaptive Testing (TPAMI 2026)
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 10, 2026 · Artificial Intelligence

How Much Has GPT‑5.4 Improved? Hands‑On Test of Its Three Core Capabilities and Computer Control

After GPT‑5.4’s March release, the author benchmarks it against Claude Opus 4.6 and Gemini 3.1 Pro, evaluates its knowledge‑work, native computer‑control, and programming abilities through three hands‑on tasks—including data‑analysis, code‑base inspection, and a complex math‑modeling contest—revealing strong gains but still notable limitations.

AI model evaluationGPT-5.4benchmark
0 likes · 11 min read
How Much Has GPT‑5.4 Improved? Hands‑On Test of Its Three Core Capabilities and Computer Control
Aikesheng Open Source Community
Aikesheng Open Source Community
Nov 21, 2025 · Artificial Intelligence

Gemini 3 Pro Leads SQL Benchmarks with Deep Understanding, High‑Quality Optimization, and Balanced Dialect Conversion

The SCALE evaluation shows Gemini 3 Pro topping the SQL benchmark leaderboard, achieving No.1 in SQL understanding, No.2 in optimization, and No.6 in dialect conversion, while highlighting its strengths in execution accuracy, syntax error detection, and areas needing improvement such as execution‑plan prediction and large‑SQL handling.

AI model evaluationGemini-3-ProSCALE Framework
0 likes · 12 min read
Gemini 3 Pro Leads SQL Benchmarks with Deep Understanding, High‑Quality Optimization, and Balanced Dialect Conversion
AI Tech Publishing
AI Tech Publishing
Nov 17, 2025 · Artificial Intelligence

Frontier AI Models in RL Environments Reveal an Agent Capability Hierarchy

The article evaluates nine cutting‑edge AI models on 150 simulated workplace tasks, showing that even the strongest models complete fewer than 40% of tasks, and uses these results to propose a hierarchical framework of agentic capabilities ranging from tool use to common‑sense reasoning.

AI model evaluationagentic capabilitiescommon sense reasoning
0 likes · 19 min read
Frontier AI Models in RL Environments Reveal an Agent Capability Hierarchy
Sohu Tech Products
Sohu Tech Products
Sep 10, 2025 · Artificial Intelligence

Can Kimi K2 Replace Claude’s Brain? A Deep Dive into AI‑Powered Code Agents

This article evaluates whether the domestically‑developed Kimi K2 model can serve as a cost‑effective alternative brain for Claude Code, detailing step‑by‑step integration, performance tests across task accuracy, advanced feature compatibility, memory retrieval, parallel development with Git Worktree, and hook automation, concluding with strengths, limitations, and overall success.

AI model evaluationClaude CodeKimi K2
0 likes · 18 min read
Can Kimi K2 Replace Claude’s Brain? A Deep Dive into AI‑Powered Code Agents
Aikesheng Open Source Community
Aikesheng Open Source Community
Aug 20, 2025 · Artificial Intelligence

GPT‑5 Models Ranked: Which Variant Excels at SQL Tasks?

An in‑depth August 2025 benchmark evaluates GPT‑5’s mini, nano, and chat variants on SQL understanding, optimization, and dialect conversion, revealing gpt‑5‑mini’s balanced performance, gpt‑5‑nano’s strong code‑generation accuracy, and gpt‑5‑chat’s theoretical strengths but practical shortcomings, guiding scenario‑specific model selection.

AI model evaluationArtificial IntelligenceGPT-5
0 likes · 9 min read
GPT‑5 Models Ranked: Which Variant Excels at SQL Tasks?
Java Tech Enthusiast
Java Tech Enthusiast
Feb 26, 2025 · Artificial Intelligence

Claude 3.7 Sonnet: How It Crushes Coding, Physics Simulations, and Logic Puzzles

Claude 3.7 Sonnet demonstrates unprecedented programming speed, realistic physics simulation, advanced reasoning on misleading benchmarks, and strong productivity tools, while Anthropic secures a $3.5 billion funding round, making it a standout AI model in both technical capability and market impact.

AI model evaluationClaude 3.7Logic Reasoning
0 likes · 11 min read
Claude 3.7 Sonnet: How It Crushes Coding, Physics Simulations, and Logic Puzzles