Tagged articles
21 articles
Page 1 of 1
Old Zhang's AI Learning
Old Zhang's AI Learning
May 20, 2026 · Artificial Intelligence

Qwen 3.7‑Max vs Claude 4.7: 7 In‑Depth Tests Reveal a Smooth, Powerful Model

The author evaluates Alibaba’s newly released Qwen 3.7‑Max across seven rigorous tasks—including reading comprehension, HTML fireworks generation, 3D particle visualizations, PDF‑to‑PPT conversion, Excel data analysis, GitHub trending scraping, and complex video generation—showing it often surpasses GPT‑5.5‑level models and rivals Claude 4.7, especially in long‑duration agent tasks.

AI BenchmarkClaude 4.7Model Evaluation
0 likes · 9 min read
Qwen 3.7‑Max vs Claude 4.7: 7 In‑Depth Tests Reveal a Smooth, Powerful Model
DataFunTalk
DataFunTalk
May 19, 2026 · Artificial Intelligence

Qwen 3.7 Max Preview Lands: Rapid Dual‑Model Iteration Keeps China’s Lead in Text and Vision

The Qwen 3.7‑Max and Qwen 3.7‑Plus preview models debut with top‑15 global rankings in Arena, the only Chinese models in text and vision leaderboards, while a timeline analysis shows the Qwen series accelerating from 4‑6‑month releases to a 2‑3‑month cadence and introducing dense and MoE variants up to 235 B parameters.

AI BenchmarkChinese AIModel Iteration
0 likes · 6 min read
Qwen 3.7 Max Preview Lands: Rapid Dual‑Model Iteration Keeps China’s Lead in Text and Vision
Machine Heart
Machine Heart
May 2, 2026 · Artificial Intelligence

Why GPT‑5.5 and Claude Opus 4.7 Score Below 1% on ARC‑AGI‑3 While Humans Achieve 100%

The ARC‑AGI‑3 benchmark shows that GPT‑5.5 (0.43%) and Claude Opus 4.7 (0.18%) fail to solve any of the 135 novel environments, whereas a six‑year‑old human solves them all, and the analysis attributes the gap to three concrete failure modes and differing compression abilities of the two models.

AI BenchmarkARC-AGI-3Claude Opus 4.7
0 likes · 10 min read
Why GPT‑5.5 and Claude Opus 4.7 Score Below 1% on ARC‑AGI‑3 While Humans Achieve 100%
Machine Heart
Machine Heart
Apr 21, 2026 · Artificial Intelligence

Kimi K2.6 Unveils 300‑Agent Swarm, Ending the Single‑Agent Era

The newly released Kimi K2.6 model expands the Agent Swarm to coordinate up to 300 agents, delivers significant gains in coding speed, long‑context understanding, and benchmark performance that surpasses GPT‑5.4, Claude Opus and Gemini, while showcasing end‑to‑end front‑end generation demos.

AI BenchmarkAgent SwarmCoding Assistant
0 likes · 9 min read
Kimi K2.6 Unveils 300‑Agent Swarm, Ending the Single‑Agent Era
Machine Heart
Machine Heart
Apr 20, 2026 · Artificial Intelligence

Does OpenClaw Remember You? Cambridge Launches ATM‑Bench for Long‑Term Memory

CAMBRIDGE's new ATM‑Bench evaluates AI assistants' ability to retrieve personal memories spanning years across multimodal data, revealing that leading agents like OpenClaw, Codex, and Claude Code achieve under 40% accuracy and struggle despite extensive toolchains, highlighting a fundamental long‑term memory challenge.

AI BenchmarkATM-BenchClaude Code
0 likes · 8 min read
Does OpenClaw Remember You? Cambridge Launches ATM‑Bench for Long‑Term Memory
AI Engineering
AI Engineering
Apr 1, 2026 · Artificial Intelligence

Holo3 AI Model Beats GPT‑5.4 at One‑Tenth the Cost for Computer Use

H Company’s new Holo3 series delivers a visual language model that outperforms GPT‑5.4 on the OSWorld‑Verified benchmark with a 78.85% score while costing only about one‑tenth as much, offering both a flagship API‑only version and an open‑source lightweight variant optimized for GUI agents.

AI BenchmarkGUI AgentHolo3
0 likes · 4 min read
Holo3 AI Model Beats GPT‑5.4 at One‑Tenth the Cost for Computer Use
PaperAgent
PaperAgent
Mar 21, 2026 · Artificial Intelligence

Can AI Truly Be Creative? Inside the CreativeBench Benchmark

This article examines the CreativeBench benchmark, which redefines machine creativity by measuring both the quality and novelty of generated solutions, explains its combinatorial and exploratory task designs, details the self‑evolving task construction process, and discusses key findings and the EvoRePE enhancement method.

AI BenchmarkEvoRePELarge Language Models
0 likes · 18 min read
Can AI Truly Be Creative? Inside the CreativeBench Benchmark
SuanNi
SuanNi
Mar 20, 2026 · Artificial Intelligence

How SkillCraft Shows AI Agents Can Cut Compute Costs by Up to 80%

SkillCraft, a new benchmark from Oxford and partner institutions, evaluates whether AI agents can autonomously combine basic tools into reusable skills, revealing that stronger models dramatically improve task success rates while slashing compute consumption by up to 80%, and exposing the limits of hierarchical skill nesting and cross‑model skill sharing.

AI BenchmarkCompute EfficiencySkillCraft
0 likes · 15 min read
How SkillCraft Shows AI Agents Can Cut Compute Costs by Up to 80%
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 16, 2026 · Artificial Intelligence

HeartBench: Building the First Chinese AI Humanization Benchmark

This article details the creation of HeartBench, a Chinese benchmark for evaluating large language models' emotional and social intelligence, describing its background, design principles, data pipeline, evaluation methods, multi‑stage versioning, blind‑test validation, and lessons for building transferable AI assessment frameworks.

AI BenchmarkEmotion AIHumanization
0 likes · 25 min read
HeartBench: Building the First Chinese AI Humanization Benchmark
Old Zhang's AI Learning
Old Zhang's AI Learning
Jan 27, 2026 · Artificial Intelligence

Qwen3‑Max‑Thinking Boosts Performance with Test‑Time Scaling—Why It Still Isn’t Open‑Source

Alibaba’s new Qwen3‑Max‑Thinking model adds inference‑time scaling and adaptive tool use, delivering large gains on math, coding, and agent benchmarks while remaining closed‑source, and it offers drop‑in OpenAI‑compatible API access at the cost of higher latency and token usage.

AI BenchmarkAdaptive Tool UseOpenAI API Compatibility
0 likes · 7 min read
Qwen3‑Max‑Thinking Boosts Performance with Test‑Time Scaling—Why It Still Isn’t Open‑Source
PaperAgent
PaperAgent
Dec 23, 2025 · Artificial Intelligence

CATArena: A Competitive Benchmark That Turns Agent Scoring into Evolutionary Learning

CATArena introduces a tournament‑style evaluation framework where AI agents iteratively code, compete, and improve across classic board games, using three‑dimensional quantitative scores to measure strategy programming, global learning, and generalization, and reveals how different LLM‑based agents learn and adapt over multiple rounds.

AI BenchmarkAgent EvaluationCATArena
0 likes · 8 min read
CATArena: A Competitive Benchmark That Turns Agent Scoring into Evolutionary Learning
Wuming AI
Wuming AI
Sep 6, 2025 · Artificial Intelligence

Can Qwen3-Max-Preview Outperform Claude? A Deep Dive into China’s New 1‑T LLM

The article reviews Alibaba's 1‑trillion‑parameter Qwen3‑Max‑Preview model, comparing its benchmark scores, hallucination rate, math and coding accuracy, and SVG generation quality against Claude, Kimi K2, and DeepSeek, while providing usage links and real‑world user impressions.

AI BenchmarkQwen3SVG generation
0 likes · 4 min read
Can Qwen3-Max-Preview Outperform Claude? A Deep Dive into China’s New 1‑T LLM
Programmer DD
Programmer DD
Apr 29, 2025 · Artificial Intelligence

Why Qwen3 Is Redefining Open‑Source LLMs: Mixed‑Inference Power and Unmatched Performance

Qwen3, Alibaba’s latest open‑source large language model, introduces a pioneering mixed‑inference architecture that blends top‑tier reasoning and non‑reasoning capabilities, delivering record‑breaking benchmark scores, multilingual support for 119 languages, cost‑effective deployment, and a 128K context window, now accessible via Ollama and OpenRouter.

AI BenchmarkQwen3large language model
0 likes · 5 min read
Why Qwen3 Is Redefining Open‑Source LLMs: Mixed‑Inference Power and Unmatched Performance
Baidu Geek Talk
Baidu Geek Talk
Apr 16, 2025 · Industry Insights

What Do the Latest AIIA FactTesting Benchmarks Reveal About China’s Large Language Models?

At the AIIA’s 14th plenary meeting in Nanjing, the FactTesting benchmark released its Q1 2025 results, evaluating over 200 large models and highlighting Baidu’s Wenxin 4.5 and Wenxin X1 as leaders in basic and reasoning capabilities, while outlining the expanded multimodal and agent testing roadmap for the year.

AI BenchmarkChina AIFactTesting
0 likes · 5 min read
What Do the Latest AIIA FactTesting Benchmarks Reveal About China’s Large Language Models?
Top Architect
Top Architect
Mar 9, 2025 · Artificial Intelligence

Alibaba Unveils Qwen QwQ-32B: A Compact Open‑Source LLM Rivaling DeepSeek

Alibaba has released the open‑source Qwen QwQ‑32B model, a 32‑billion‑parameter LLM that matches DeepSeek‑R1's performance while being deployable on consumer‑grade GPUs, and the announcement is accompanied by extensive promotional offers for AI‑related products and services.

AI BenchmarkAlibabaQwen
0 likes · 7 min read
Alibaba Unveils Qwen QwQ-32B: A Compact Open‑Source LLM Rivaling DeepSeek
Java Tech Enthusiast
Java Tech Enthusiast
Mar 8, 2025 · Artificial Intelligence

QwQ-32B Large Language Model Overview and Performance

Alibaba’s new QwQ‑32B large‑language model, with 32 billion parameters, delivers performance comparable to or surpassing the 671‑billion‑parameter DeepSeek‑R1 across math, coding, and general benchmarks, and is available via HuggingFace, ModelScope, and a DashScope API demo with example Python code.

AI BenchmarkPython APIlarge language model
0 likes · 5 min read
QwQ-32B Large Language Model Overview and Performance
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 7, 2025 · Artificial Intelligence

How QwQ-32B Outperforms OpenAI o1-mini and Deploys in One Click on Alibaba Cloud

Alibaba Cloud's newly released QwQ-32B model delivers benchmark‑level performance rivaling top open‑source LLMs, integrates agent capabilities, and can be deployed with a single click through the PAI‑Model Gallery, offering a cost‑effective solution for developers seeking advanced AI inference.

AI BenchmarkAlibaba CloudLLM
0 likes · 5 min read
How QwQ-32B Outperforms OpenAI o1-mini and Deploys in One Click on Alibaba Cloud
AI Algorithm Path
AI Algorithm Path
Feb 22, 2025 · Artificial Intelligence

Elon Musk Unveils Grok 3, Claiming the World’s Most Powerful AI Model

The article details the launch of Grok 3 by Elon Musk’s xAI, highlighting its massive GPU infrastructure, benchmark dominance over GPT‑4o, multiple model variants, pricing for Premium+ users, upcoming API and voice features, and the team’s plan to open‑source Grok 2 once the new model stabilises.

AI BenchmarkAI pricingElon Musk
0 likes · 6 min read
Elon Musk Unveils Grok 3, Claiming the World’s Most Powerful AI Model
Architects' Tech Alliance
Architects' Tech Alliance
Feb 10, 2025 · Industry Insights

What Makes DeepSeek’s New V3 Model Rival GPT‑4o? A Deep Dive into Large‑Scale AI

This article explains what defines a large AI model, compares parameter scales of GPT‑3, GPT‑4 and M6, and analyzes DeepSeek’s recent releases—V3, R1, and Janus‑Pro—highlighting their benchmark performance, reinforcement‑learning techniques, and cost efficiency versus leading proprietary models.

AI BenchmarkDeepSeekModel Scaling
0 likes · 5 min read
What Makes DeepSeek’s New V3 Model Rival GPT‑4o? A Deep Dive into Large‑Scale AI
JavaEdge
JavaEdge
Dec 1, 2024 · Artificial Intelligence

Exploring the Limits and Benchmarks of Qwen’s QwQ‑32B‑Preview AI Model

QwQ‑32B‑Preview, an experimental AI model from the Qwen team, showcases strong reasoning in math and programming while facing challenges like language switching, inference loops, safety concerns, and variable capabilities across domains, with benchmark scores ranging from 50% to over 90% on tests such as GPQA, AIME, MATH‑500, and LiveCodeBench.

AI BenchmarkLLMModel Evaluation
0 likes · 7 min read
Exploring the Limits and Benchmarks of Qwen’s QwQ‑32B‑Preview AI Model
Java Tech Enthusiast
Java Tech Enthusiast
Jul 12, 2024 · Artificial Intelligence

Why Alibaba’s Qwen‑2 Is Outperforming Global LLMs and What It Means for AI

After OpenAI halted API access in China, Alibaba’s Tongyi Qwen‑2 quickly rose to the top of global open‑source LLM leaderboards, surpassing Meta’s Llama‑3 and other contenders, with detailed benchmark scores, performance gains over previous versions, and implications for China’s AI ecosystem.

AI BenchmarkAlibabaChina AI
0 likes · 5 min read
Why Alibaba’s Qwen‑2 Is Outperforming Global LLMs and What It Means for AI