Tagged articles
36 articles
Page 1 of 1
Old Zhang's AI Learning
Old Zhang's AI Learning
May 6, 2026 · Artificial Intelligence

GPT-5.5 Instant Arrives: Smarter, Clearer, More Personalized AI

OpenAI has silently replaced the default ChatGPT model with GPT‑5.5 Instant, delivering a 52.5% drop in hallucinations, 30% shorter responses, deeper personalization via memory sources, and higher benchmark scores across a range of professional tasks, while rolling out new pricing and usage tiers.

AI benchmarksChatGPTGPT-5.5
0 likes · 11 min read
GPT-5.5 Instant Arrives: Smarter, Clearer, More Personalized AI
SuanNi
SuanNi
May 5, 2026 · Artificial Intelligence

Anthropic Co‑Founder Predicts 60% Chance AI Will Self‑Develop the Next‑Gen Model by End‑2028

Jack Clark’s Import AI analysis forecasts that, based on accelerating benchmark scores such as SWE‑Bench and METR, there is a 60% probability that by the end of 2028 AI systems will be able to autonomously design and train the next generation of more capable models, reshaping research, economics, and alignment challenges.

AI AlignmentAI benchmarksAI economics
0 likes · 15 min read
Anthropic Co‑Founder Predicts 60% Chance AI Will Self‑Develop the Next‑Gen Model by End‑2028
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 5, 2026 · Artificial Intelligence

Will AI Achieve Recursive Self‑Improvement by 2028? Anthropic’s 60% Forecast

Anthropic co‑founder Jack Clark predicts a 60% chance that by the end of 2028 AI systems will be capable of recursive self‑improvement, citing rapid progress on benchmarks such as CORE‑Bench, PostTrainBench, SWE‑Bench, METR, and emerging capabilities in kernel design, agentic coding, and AI‑to‑AI management.

AI AlignmentAI automationAI benchmarks
0 likes · 25 min read
Will AI Achieve Recursive Self‑Improvement by 2028? Anthropic’s 60% Forecast
Machine Heart
Machine Heart
May 5, 2026 · Artificial Intelligence

Anthropic Cofounder Predicts 60% Chance AI Will Self‑Evolve by 2028

Jack Clark, Anthropic’s co‑founder, argues that based on a sweep of public AI benchmarks—including CORE‑Bench, PostTrainBench, MLE‑Bench, SWE‑Bench and METR—there is roughly a 60% probability that recursive self‑improvement will emerge by the end of 2028, raising profound technical and alignment challenges.

AI AlignmentAI automationAI benchmarks
0 likes · 23 min read
Anthropic Cofounder Predicts 60% Chance AI Will Self‑Evolve by 2028
DataFunTalk
DataFunTalk
Apr 30, 2026 · Artificial Intelligence

How GenericAgent Cuts Token Costs by 10× While Boosting AI Agent Performance

The technical report on GenericAgent, a self‑evolving LLM‑based agent, shows that by maximizing context information density and using a minimal atomic toolset with hierarchical memory, it achieves up to ten‑fold token savings, 100% task accuracy, and progressive efficiency gains across multiple benchmarks.

AI benchmarksGenericAgentLLM
0 likes · 15 min read
How GenericAgent Cuts Token Costs by 10× While Boosting AI Agent Performance
ArcThink
ArcThink
Apr 27, 2026 · Artificial Intelligence

Why GPT‑5.5 Is a True Generational Leap: Deep Dive vs. Claude Opus 4.7

GPT‑5.5, the first fully retrained base model since GPT‑4.5, delivers an 11.7‑point jump on ARC‑AGI‑2, wins 9 of 10 shared benchmarks, shows superior agent and ultra‑long‑context performance, yet incurs higher latency and token pricing, while Claude Opus 4.7 excels on deep‑reasoning tasks, marking a multi‑pole era for frontier AI.

AI benchmarksClaude Opus 4.7GPT-5.5
0 likes · 16 min read
Why GPT‑5.5 Is a True Generational Leap: Deep Dive vs. Claude Opus 4.7
PaperAgent
PaperAgent
Apr 26, 2026 · Artificial Intelligence

ICLR 2026 Outstanding Papers Reveal the Real Test for LLMs

The ICLR 2026 Outstanding Paper awards spotlight two studies—one proving Transformers are mathematically succinct and another showing that all major LLMs lose about 39% performance in multi‑turn conversations, exposing a reliability gap missed by single‑turn benchmarks.

AI benchmarksICLR 2026LLM evaluation
0 likes · 7 min read
ICLR 2026 Outstanding Papers Reveal the Real Test for LLMs
AI Waka
AI Waka
Apr 22, 2026 · Artificial Intelligence

Why Enterprise AI Must Prioritize Augmented Intelligence Over Pure Automation

The article analyzes how current AI benchmarks overstate model capabilities, reveals performance gaps in real‑world professional tasks, and argues that effective enterprise AI requires augmented intelligence through governance engineering, context management, and human‑in‑the‑loop design rather than full automation.

AI benchmarksContext Windowaugmented intelligence
0 likes · 23 min read
Why Enterprise AI Must Prioritize Augmented Intelligence Over Pure Automation
Lao Guo's Learning Space
Lao Guo's Learning Space
Apr 20, 2026 · Artificial Intelligence

Claude Opus 4.7: Programming Power Peaks but Faces ‘Dumbing‑Down’ Criticism

Anthropic’s Claude Opus 4.7 launches with record‑breaking programming benchmarks, a new xhigh effort mode and a free 1 M‑token context window, yet an AMD audit reveals a steep drop in real‑world engineering accuracy, reduced cache TTL and a shift to usage‑based pricing that has sparked community backlash.

1M token contextAI benchmarksClaude Opus 4.7
0 likes · 10 min read
Claude Opus 4.7: Programming Power Peaks but Faces ‘Dumbing‑Down’ Criticism
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 17, 2026 · Artificial Intelligence

Claude Opus 4.7’s Visual and Long‑Context Leap: Near‑Full Vision and 1M‑Token Tasks Redefine Knowledge Work

Claude Opus 4.7, announced as Anthropic’s most capable publicly available model, dramatically improves visual reasoning, long‑context task handling and instruction following, delivering up to a 2.4‑fold boost on benchmarks such as XBOW, SWE‑bench and structural biology, while also introducing new security guardrails and token‑usage costs.

AI benchmarksAnthropicClaude Opus 4.7
0 likes · 11 min read
Claude Opus 4.7’s Visual and Long‑Context Leap: Near‑Full Vision and 1M‑Token Tasks Redefine Knowledge Work
DataFunTalk
DataFunTalk
Apr 8, 2026 · Artificial Intelligence

Claude Mythos Preview Crushes Benchmarks and Reveals 27‑Year‑Old Zero‑Day

Anthropic's Claude Mythos Preview outperforms GPT‑5.4, Gemini 3.1 Pro and Opus 4.6 across dozens of AI benchmarks, autonomously discovers thousands of software vulnerabilities, exploits them without human guidance, and raises serious alignment and security concerns for the industry.

AI benchmarksAnthropicClaude Mythos
0 likes · 15 min read
Claude Mythos Preview Crushes Benchmarks and Reveals 27‑Year‑Old Zero‑Day
PaperAgent
PaperAgent
Mar 6, 2026 · Artificial Intelligence

Which Frontier AI Model Leads 2026? GPT‑5.4 vs Opus 4.6 vs Gemini 3.1 Pro

A detailed 2026 benchmark comparison shows GPT‑5.4 excelling in knowledge work and native computer use, Gemini 3.1 Pro dominating inference at the lowest price, and Opus 4.6 leading software‑engineering tasks, while highlighting distinct pricing tiers, context‑window sizes, and the need for multi‑model routing.

AI benchmarksGPT-5.4Gemini 3.1 Pro
0 likes · 12 min read
Which Frontier AI Model Leads 2026? GPT‑5.4 vs Opus 4.6 vs Gemini 3.1 Pro
AI Explorer
AI Explorer
Mar 6, 2026 · Artificial Intelligence

GPT-5.4 Unveiled: 1M‑Token Context Window and Native Computer Control

OpenAI's GPT-5.4 launch introduces three model tiers, a 1 million‑token context window, native computer‑use abilities, higher factual accuracy and a new Tool Search feature, reshaping enterprise AI capabilities and intensifying competition with Anthropic and Google.

AI benchmarksComputer UseContext Window
0 likes · 9 min read
GPT-5.4 Unveiled: 1M‑Token Context Window and Native Computer Control
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 26, 2026 · Artificial Intelligence

Grok 4.20 Returns: Inside Its Multi‑Agent Design and Real‑World Benchmarks

The article examines the surprise launch of Grok 4.20, detailing its four‑agent architecture, how it cuts hallucinations by about 65%, and presents third‑party benchmark rankings that place it first in Search Arena and fourth in Text Arena, while also showcasing user‑tested code‑generation and creative capabilities.

AI benchmarksCode GenerationGrok 4.20
0 likes · 7 min read
Grok 4.20 Returns: Inside Its Multi‑Agent Design and Real‑World Benchmarks
ShiZhen AI
ShiZhen AI
Feb 20, 2026 · Artificial Intelligence

Gemini 3.1 Pro Doubles Reasoning Scores, Beats Claude and GPT on ARC‑AGI‑2

Google’s Gemini 3.1 Pro achieves a 148% jump to 77.1% on the ARC‑AGI‑2 benchmark, scores a perfect 100% on AIME 2025, outperforms Claude Opus 4.6 and GPT‑5.2 on abstract reasoning, while offering 1 M‑token context, real‑time code demos, and immediate platform rollout.

AI benchmarksAIME 2025ARC-AGI-2
0 likes · 7 min read
Gemini 3.1 Pro Doubles Reasoning Scores, Beats Claude and GPT on ARC‑AGI‑2
AI Engineering
AI Engineering
Feb 20, 2026 · Artificial Intelligence

Gemini 3.1 Pro Doubles Reasoning Power and Outperforms Claude Opus 4.6

Google's Gemini 3.1 Pro achieves a 77.1% ARC‑AGI‑2 score—more than double its predecessor—leads in multiple benchmark categories, cuts inference cost by half compared to top rivals, and demonstrates advanced multimodal and programming capabilities through real‑world demos.

AI benchmarksARC-AGI-2Claude Opus 4.6
0 likes · 9 min read
Gemini 3.1 Pro Doubles Reasoning Power and Outperforms Claude Opus 4.6
Wuming AI
Wuming AI
Feb 20, 2026 · Artificial Intelligence

Gemini 3.1 Pro: How Google Boosted Reasoning Scores and What It Means for Developers

Google's Gemini 3.1 Pro preview raises reasoning benchmark scores dramatically, offers new pricing tiers, and is already integrated into Gemini API, CLI, Vertex AI, and consumer apps, while community demos showcase SVG animation, real‑time dashboards, 3D simulations, and heat‑transfer analysis.

AI benchmarksGemini 3.1 ProGoogle AI
0 likes · 5 min read
Gemini 3.1 Pro: How Google Boosted Reasoning Scores and What It Means for Developers
Node.js Tech Stack
Node.js Tech Stack
Feb 18, 2026 · Artificial Intelligence

Claude Sonnet 4.6 Unveiled: The New ‘Super‑Worker’ Model with Epic Computer‑Use Leap

Anthropic’s Claude Sonnet 4.6, released on Chinese New Year, boosts computer‑use ability, supports a 1 million‑token context window, adds dynamic web‑search filtering, and improves benchmark scores (OSWorld 72.5%, SWE‑bench 79.6%, GPQA 89.9%) while keeping the same price, earning high praise from industry leaders.

1M token contextAI benchmarksAnthropic
0 likes · 8 min read
Claude Sonnet 4.6 Unveiled: The New ‘Super‑Worker’ Model with Epic Computer‑Use Leap
ShiZhen AI
ShiZhen AI
Feb 17, 2026 · Artificial Intelligence

Sonnet 4.6 Nears Opus Performance While Retaining Sonnet Pricing

Anthropic released Sonnet 4.6 just 12 days after Opus 4.6, delivering near‑Opus capabilities across coding, computer use, long‑context reasoning, and agent planning with a 1 M‑token window, while keeping the lower Sonnet price, prompting mixed community debate and rapid ecosystem adoption.

AI benchmarksAgent planningAnthropic
0 likes · 12 min read
Sonnet 4.6 Nears Opus Performance While Retaining Sonnet Pricing
AI Engineering
AI Engineering
Feb 12, 2026 · Artificial Intelligence

GLM-5 Unveiled: 744B‑Parameter Model Takes on Claude in Complex Tasks

GLM-5, the new 744‑billion‑parameter open‑source LLM, expands on GLM‑4.5 with GlmMoeDsa architecture, achieves higher HLE benchmark scores than Claude Opus 4.5, demonstrates strong long‑context and agent capabilities, supports vLLM/SGLang, runs on various Chinese chips, and can directly generate Office documents.

AI benchmarksChinese chipsClaude
0 likes · 5 min read
GLM-5 Unveiled: 744B‑Parameter Model Takes on Claude in Complex Tasks
PaperAgent
PaperAgent
Feb 11, 2026 · Artificial Intelligence

Unlocking Agentic Reasoning: A Deep Dive into the New LLM Paradigm

This comprehensive review dissects the emerging Agentic Reasoning paradigm for large language models, outlining its three‑layer architecture, core capabilities, optimization modes, benchmark suites, and real‑world applications across mathematics, science, embodied AI, healthcare, and autonomous web exploration.

AI benchmarksAgentic ReasoningAutonomous Agents
0 likes · 10 min read
Unlocking Agentic Reasoning: A Deep Dive into the New LLM Paradigm
AI Info Trend
AI Info Trend
Jan 7, 2026 · Artificial Intelligence

MiroThinker 1.5: 30B Model Beats 1T‑Scale LLMs via Interactive Scaling

Released by the MiroMind team, MiroThinker 1.5 demonstrates that a 30‑billion‑parameter model can match or surpass the performance of 1‑trillion‑parameter LLMs by leveraging Interactive Scaling, achieving top rankings on multiple search benchmarks, dramatically lower inference cost, and open‑source availability for developers.

AI benchmarksMiroThinkerinteractive scaling
0 likes · 6 min read
MiroThinker 1.5: 30B Model Beats 1T‑Scale LLMs via Interactive Scaling
Design Hub
Design Hub
Dec 12, 2025 · Artificial Intelligence

GPT-5.2 Unveiled: A Cutting-Edge AI Super-Assistant Built for Real-World Work

OpenAI's newly released GPT-5.2 claims to outperform human experts on about 70% of real tasks, achieve a perfect score on the AIME 2025 competition, and deliver dramatic efficiency gains—up to 390× cost reduction—while showcasing impressive examples such as one‑shot ocean shader generation, a full 3D engine built in a single file, and visual‑perception scores rivaling top models.

AI benchmarksAgent AIDesign Automation
0 likes · 8 min read
GPT-5.2 Unveiled: A Cutting-Edge AI Super-Assistant Built for Real-World Work
ShiZhen AI
ShiZhen AI
Dec 6, 2025 · Artificial Intelligence

OpenAI’s Daily Users Plunge 12 M as Gemini 3 Threatens; GPT‑5.2 Rushed for Dec 9

Amid a 6% (≈12 million) daily‑active‑user decline triggered by Google’s Gemini 3 launch, OpenAI’s leadership issued a “red‑alert”, accelerated the release of GPT‑5.2 to Dec 9, halted ad and Pulse projects, and outlined strategic risks, competitive benchmarks, and the future “Garlic” roadmap.

AI Industry AnalysisAI benchmarksGPT-5.2
0 likes · 15 min read
OpenAI’s Daily Users Plunge 12 M as Gemini 3 Threatens; GPT‑5.2 Rushed for Dec 9
Instant Consumer Technology Team
Instant Consumer Technology Team
Nov 21, 2025 · Artificial Intelligence

Gemini 3 Pro Unleashed: From Instant Webpage Replication to Record‑Breaking AI Benchmarks

The author puts Google’s Gemini 3 Pro through a series of real‑world tests—replicating popular homepages, generating weather cards, creating interactive games and 3D animations, and measuring benchmark scores—showing dramatic improvements over Gemini 2.5 Pro and highlighting its multimodal reasoning, code generation, and API availability.

AI benchmarksCode GenerationGemini 3
0 likes · 7 min read
Gemini 3 Pro Unleashed: From Instant Webpage Replication to Record‑Breaking AI Benchmarks
21CTO
21CTO
Nov 5, 2025 · Artificial Intelligence

Why OpenAI Is Building a New Indian Language Benchmark (IndQA) and What It Means for AI

OpenAI acknowledges that existing multilingual AI benchmarks like MMMLU are saturated and insufficient for cultural nuance, so it is launching IndQA—a comprehensive test covering 12 Indian languages and ten cultural domains—to better evaluate models' understanding and reasoning across diverse regional contexts.

AI benchmarksIndQAIndian languages
0 likes · 4 min read
Why OpenAI Is Building a New Indian Language Benchmark (IndQA) and What It Means for AI
Baidu Tech Salon
Baidu Tech Salon
Oct 10, 2025 · Artificial Intelligence

Navigating the 2025 AI Model Boom: Practical Evaluation Strategies

This article examines the rapid surge of large AI models in 2024‑2025, critiques the reliability of public leaderboards, and presents a business‑focused evaluation framework—including dataset construction, metric selection, automation, and LLM‑as‑judge techniques—to help developers choose the right model for real‑world applications.

AI PerformanceAI benchmarksLLM-as-judge
0 likes · 17 min read
Navigating the 2025 AI Model Boom: Practical Evaluation Strategies
AI Info Trend
AI Info Trend
Aug 12, 2025 · Artificial Intelligence

OpenAI’s First Open‑Source Weights: Inside gpt‑oss‑120B & 20B Models

OpenAI has unveiled its first open‑source weight models in over five years—gpt‑oss‑120B and gpt‑oss‑20B—detailing their MoE architecture, quantization techniques, benchmark performance, licensing, and the industry’s mixed reactions, while hinting at future open‑source AI developments.

AI benchmarksGPT-OSSIndustry analysis
0 likes · 6 min read
OpenAI’s First Open‑Source Weights: Inside gpt‑oss‑120B & 20B Models
Data Party THU
Data Party THU
Aug 11, 2025 · Artificial Intelligence

What Makes GPT‑5 the Most Powerful AI Model Yet? A Deep Dive into Its Architecture and Benchmarks

The article analyzes GPT‑5’s unified system, advanced reasoning models, and impressive benchmark gains across programming, creative writing, and health domains, highlighting its new router, Verbosity API, and record‑setting performance on tasks such as Aider polyglot, AIME 2025, and HealthBench.

AI benchmarksAI reasoningGPT-5
0 likes · 7 min read
What Makes GPT‑5 the Most Powerful AI Model Yet? A Deep Dive into Its Architecture and Benchmarks
DataFunTalk
DataFunTalk
Feb 2, 2025 · Artificial Intelligence

DeepSeek Releases Janus‑Pro‑7B Multimodal Model, Beats DALL‑E 3 and Stable Diffusion on Benchmarks

DeepSeek's newly released Janus‑Pro‑7B multimodal model, open‑sourced overnight, outperforms DALL‑E 3 and Stable Diffusion on GenEval and DPG‑Bench, showcases a unified self‑regressive architecture with SigLIP‑L visual encoder, and has sparked massive user adoption and market reactions worldwide.

AI benchmarksDeepSeek
0 likes · 9 min read
DeepSeek Releases Janus‑Pro‑7B Multimodal Model, Beats DALL‑E 3 and Stable Diffusion on Benchmarks
NewBeeNLP
NewBeeNLP
Aug 22, 2024 · Artificial Intelligence

How to Fine‑Tune GPT‑4o for Free: Costs, Steps, and Real‑World Benchmarks

OpenAI has launched low‑cost fine‑tuning for GPT‑4o, offering free daily training tokens, a simple dashboard workflow, and early benchmark results that show significant performance gains, while the community debates the merits of fine‑tuning versus prompt‑caching for efficient AI applications.

AI benchmarksFine-tuningGPT-4o
0 likes · 6 min read
How to Fine‑Tune GPT‑4o for Free: Costs, Steps, and Real‑World Benchmarks
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jul 30, 2022 · Cloud Computing

Highlights from the First China Computing Conference: Cloud Computing as the Foundation of the Digital Economy

The inaugural China Computing Conference in Jinan featured keynote speeches by Alibaba Cloud leaders emphasizing cloud computing as the backbone of the digital economy, showcased breakthrough immersion liquid‑cooling technology, the Zhenduan heterogeneous computing platform with record‑breaking AI benchmark results, and announced a series of innovative cloud‑native solutions and awards.

AI benchmarksAlibaba CloudDigital Economy
0 likes · 6 min read
Highlights from the First China Computing Conference: Cloud Computing as the Foundation of the Digital Economy