Tagged articles
38 articles
Page 1 of 1
Java Tech Enthusiast
Java Tech Enthusiast
May 19, 2026 · Artificial Intelligence

Why Microsoft Is Dropping the Superior Claude Code for Its Own Copilot CLI

Microsoft is forcing thousands of engineers to abandon the higher‑scoring Claude Code AI coding assistant in favor of GitHub Copilot CLI by June 30, citing cost savings, internal security requirements, and a six‑week migration window despite Claude Code’s better benchmark performance and larger context window.

AI coding assistantsClaude CodeGitHub Copilot
0 likes · 8 min read
Why Microsoft Is Dropping the Superior Claude Code for Its Own Copilot CLI
SuanNi
SuanNi
May 16, 2026 · Artificial Intelligence

Can a 4B Small Model Replace Top‑Tier Closed‑Source LLMs? Microsoft’s Terminus‑4B Cuts Token Use by 30%

Microsoft’s research shows that a 4‑billion‑parameter small model, Terminus‑4B, can act as an execution sub‑agent for terminal tasks, trimming token consumption by about 30% while preserving performance on demanding SWE‑Bench benchmarks, demonstrating a practical alternative to costly large models.

AI programmingRL trainingSWE-bench
0 likes · 7 min read
Can a 4B Small Model Replace Top‑Tier Closed‑Source LLMs? Microsoft’s Terminus‑4B Cuts Token Use by 30%
21CTO
21CTO
May 15, 2026 · Artificial Intelligence

Why Microsoft Is Dropping Claude Code Despite Its Superior Performance

Microsoft will revoke internal Claude Code licenses and force engineers to switch to GitHub Copilot CLI, citing cost savings and ecosystem control, even though benchmark data shows Claude Code outperforms Copilot on SWE‑bench, multi‑file refactoring, and large‑context tasks.

AI coding assistantsAnthropicClaude Code
0 likes · 6 min read
Why Microsoft Is Dropping Claude Code Despite Its Superior Performance
Old Meng AI Explorer
Old Meng AI Explorer
May 2, 2026 · Artificial Intelligence

Mastering Claude Code: A Complete Beginner‑to‑Expert Guide

This article provides a comprehensive walkthrough of Claude Code, covering its core capabilities, cross‑platform installation, login methods, basic commands, advanced features like hooks and MCP, a detailed comparison with GitHub Copilot, best‑practice prompts, and FAQs, enabling developers to boost productivity with an AI‑driven terminal assistant.

AI coding assistantCLIClaude Code
0 likes · 16 min read
Mastering Claude Code: A Complete Beginner‑to‑Expert Guide
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 29, 2026 · Artificial Intelligence

Top 10 Open‑Source LLM Benchmarks: Scores, Rankings, and What They Test

This article walks through ten mainstream open‑source large‑model benchmarks—SWE‑bench Verified and Pro, MMLU‑Pro, GPQA Diamond, HLE, AIME, HMMT, olmOCR‑bench, Terminal‑Bench 2.0, and EvasionBench—explaining their data, evaluation metrics, current leading models, and the capability dimensions they reveal.

AI EvaluationLLM benchmarksMMLU-Pro
0 likes · 20 min read
Top 10 Open‑Source LLM Benchmarks: Scores, Rankings, and What They Test
JavaGuide
JavaGuide
Apr 27, 2026 · Artificial Intelligence

DeepSeek V4 Slashes Prices by 75% – Real‑World Claude Code Test with 4M Tokens

DeepSeek V4’s pricing fell 75% overnight, making the V4‑Pro and V4‑Flash models dramatically cheaper than competing AI services; the article details the new rates, compares them with other providers, shows two Claude Code case studies consuming nearly 4 million tokens, and explains how domestic Ascend 950 hardware enables the discount.

AI pricingAscend 950Claude Code
0 likes · 13 min read
DeepSeek V4 Slashes Prices by 75% – Real‑World Claude Code Test with 4M Tokens
Machine Heart
Machine Heart
Apr 26, 2026 · Artificial Intelligence

Surpassing Claude Mythos and GPT‑5.5: Stanford’s New LLM‑as‑a‑Verifier Agent Framework

Stanford, Berkeley and Nvidia introduce LLM‑as‑a‑Verifier, a verification framework that scales verification compute, uses fine‑grained score tokens, repeated checks and criteria decomposition to boost agent performance, eliminate scoring ties and achieve SOTA results on Terminal‑Bench, surpassing Claude Mythos and GPT‑5.5 while improving safety in long‑horizon tasks.

Agent VerificationLLMLLM-as-a-Verifier
0 likes · 8 min read
Surpassing Claude Mythos and GPT‑5.5: Stanford’s New LLM‑as‑a‑Verifier Agent Framework
Java Web Project
Java Web Project
Apr 25, 2026 · Artificial Intelligence

Why GPT-5.5’s Silent Release Signals Real Engineering Power

OpenAI’s April 23, 2026 launch of GPT-5.5 delivers record‑high scores on SWE‑Bench Pro (58.6%) and Terminal‑Bench 2.0 (82.7%), adds persistent multi‑file context, dynamic reasoning time, and token efficiency, while real‑world case studies show substantial productivity gains across engineering teams.

AI EngineeringCodexGPT-5.5
0 likes · 13 min read
Why GPT-5.5’s Silent Release Signals Real Engineering Power
Architecture Digest
Architecture Digest
Apr 23, 2026 · Artificial Intelligence

Exciting News: IntelliJ IDEA Now Integrated with Codex AI Assistant

JetBrains IDEs from version 2025.3 embed the Codex AI assistant powered by GPT‑5.4, offering faster, context‑aware code generation, project analysis, environment setup, and refactoring, with real‑world demos showing how it can download projects, configure tools, and even build a full mini‑program with minimal manual coding.

AI AssistantCodexGPT-5.4
0 likes · 7 min read
Exciting News: IntelliJ IDEA Now Integrated with Codex AI Assistant
ZhiKe AI
ZhiKe AI
Apr 21, 2026 · Artificial Intelligence

Open-Source Kimi K2.6 Beats GPT‑5.4 and Claude Opus 4.6 in Code Generation

Kimi K2.6, an open‑source Chinese LLM, outperforms GPT‑5.4 and Claude Opus 4.6 on SWE‑Bench Pro code tests, delivers 13‑hour uninterrupted coding, runs 300 parallel agents, and costs only one‑twentieth of comparable closed‑source models, while offering a trillion‑parameter MoE architecture and Apache 2.0 licensing.

AI model benchmarksApache 2.0Kimi K2.6
0 likes · 9 min read
Open-Source Kimi K2.6 Beats GPT‑5.4 and Claude Opus 4.6 in Code Generation
ShiZhen AI
ShiZhen AI
Apr 10, 2026 · Artificial Intelligence

Anthropic Advisor Strategy: Sonnet Runs, Opus Guides – Scores Up, Costs Down

Anthropic’s new Advisor Strategy lets the low‑cost Sonnet (or Haiku) model handle full agent tasks while invoking the powerful Opus model only for difficult decision points, delivering a 2.7‑point score boost on SWE‑bench with roughly 12% lower cost, and can be added with a single API call.

AI agentsAnthropicClaude
0 likes · 8 min read
Anthropic Advisor Strategy: Sonnet Runs, Opus Guides – Scores Up, Costs Down
Coder Circle
Coder Circle
Apr 8, 2026 · Industry Insights

GLM‑5.1 Enables 8‑Hour Continuous Operation and Leads SWE‑bench; Tencent Unveils First Open‑Config AI Browser

The AI daily briefing highlights GLM‑5.1’s breakthrough 8‑hour continuous reasoning, its top performance on SWE‑bench and a 10% price hike, while Tencent’s QBotClaw introduces the first domestically free‑configurable large‑model API browser, signaling a shift toward open AI ecosystems in China.

AI ecosystemAI pricingGLM-5.1
0 likes · 6 min read
GLM‑5.1 Enables 8‑Hour Continuous Operation and Leads SWE‑bench; Tencent Unveils First Open‑Config AI Browser
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 30, 2026 · Artificial Intelligence

Meituan’s Fully Discrete Multimodal Base (LongCat-Next) Shows All Physical Signals Can Converge to Tokens

LongCat-Next, a 3‑billion‑parameter multimodal model released by Meituan, adopts a pure discrete token‑based architecture (DiNA) and next‑token prediction, outperforming same‑size rivals on OmniDocBench‑EN, CharXivRQ, and matching QwenVL on visual tasks, while avoiding catastrophic forgetting and achieving a SWE‑Bench score of 43.0, as demonstrated through extensive benchmarks, receipt extraction, OCR, audio dialect reasoning, and image generation experiments.

DiNALongCat-NextOmniDocBench
0 likes · 10 min read
Meituan’s Fully Discrete Multimodal Base (LongCat-Next) Shows All Physical Signals Can Converge to Tokens
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 28, 2026 · Artificial Intelligence

Do All Physical Signals Reduce to a Single Discrete Token? LongCat‑Next Explained

LongCat‑Next, Meituan’s new 3‑billion‑parameter foundation model, adopts a pure‑discrete DiNA architecture with next‑token prediction, converting vision, audio and text into unified tokens; it surpasses same‑size multimodal models on OmniDocBench‑EN, CharXivRQ and SWE‑Bench, avoids catastrophic forgetting, and introduces dNaViT, RVQ compression and a dual‑path detokenizer for high‑fidelity generation.

DiNALongCat-NextMultimodal
0 likes · 10 min read
Do All Physical Signals Reduce to a Single Discrete Token? LongCat‑Next Explained
ShiZhen AI
ShiZhen AI
Mar 28, 2026 · Artificial Intelligence

GLM-5.1 Now Open to All: Performance vs Claude Opus, Pricing & Setup Guide

GLM-5.1 is now available to all Coding Plan subscribers, including the $10/month Lite tier, scoring 45.3 on SWE‑bench—just 5.4% below Claude Opus 4.6’s 47.9—while offering 20+ tool integrations and a manual switch from the default GLM‑4.7 model.

AI coding modelClaude OpusGLM-5.1
0 likes · 7 min read
GLM-5.1 Now Open to All: Performance vs Claude Opus, Pricing & Setup Guide
AI Insight Log
AI Insight Log
Mar 16, 2026 · Artificial Intelligence

Cursor’s Own Large‑Model Benchmark Shakes Up SWE‑bench Rankings

Although SWE‑bench scores for top coding models now differ by only a tenth of a point, Cursor’s newly released CursorBench reveals dramatic ranking changes, highlights three fundamental flaws in public benchmarks, and introduces token‑efficiency as a crucial evaluation dimension.

AI CodingCursorBenchSWE-bench
0 likes · 8 min read
Cursor’s Own Large‑Model Benchmark Shakes Up SWE‑bench Rankings
Node.js Tech Stack
Node.js Tech Stack
Feb 20, 2026 · Frontend Development

Is Frontend Dead Again? Gemini 3.1 Pro’s Leap in Reasoning and Code Generation

Google’s Gemini 3.1 Pro dramatically improves core reasoning scores (77.1% on ARC‑AGI‑2, 80.6% on Swe‑bench) and can generate interactive SVG, complex data‑driven visualizations, and creative‑coding layouts, prompting a reassessment of which front‑end tasks AI can replace and which still require architectural expertise.

AI code generationGemini 3.1 ProGoogle AI
0 likes · 6 min read
Is Frontend Dead Again? Gemini 3.1 Pro’s Leap in Reasoning and Code Generation
AI Insight Log
AI Insight Log
Feb 16, 2026 · Artificial Intelligence

DeepSeek V4 Benchmark Leak Fuels Talk of a New Coding King

A leaked SWE‑Bench score of 83.7% for DeepSeek V4 sparked claims it outperforms Claude Opus 4.5 and GPT‑5.2, but the data was later debunked as fabricated while official hints confirm a 1‑million‑token context model and a mid‑February 2026 release.

AI benchmarkingAI industryDeepSeek
0 likes · 7 min read
DeepSeek V4 Benchmark Leak Fuels Talk of a New Coding King
AI Insight Log
AI Insight Log
Feb 15, 2026 · Artificial Intelligence

Open-Source MiniMax M2.5 Hits New Year Eve: Top Coding Scores and Ultra‑Low Cost

The MiniMax M2.5 model, released open‑source on Feb 13, achieves an 80.2% SWE‑Bench Verified score that surpasses GPT‑5.2, Claude Opus 4.6 and Google Gemini 3 Pro, runs 37% faster than its predecessor, costs only $1 per hour, and demonstrates SOTA agent abilities in browsing and tool use, marking a major leap for Chinese large‑language models.

AI CodingM2.5MiniMax
0 likes · 7 min read
Open-Source MiniMax M2.5 Hits New Year Eve: Top Coding Scores and Ultra‑Low Cost
AI Engineering
AI Engineering
Feb 12, 2026 · Artificial Intelligence

MiniMax M2.5: 230B‑Parameter Model Activates 10B, Near Claude Sonnet for One‑Tenth the Cost

MiniMax’s new open‑source M2.5 model, built on a 230 billion‑parameter mixture‑of‑experts architecture that activates only 10 billion parameters per inference, delivers performance comparable to Claude Opus 4.6 across benchmarks, while costing roughly one‑tenth as much, and is already handling a large share of the company’s internal tasks.

AI agentsClaude OpusMiniMax M2.5
0 likes · 6 min read
MiniMax M2.5: 230B‑Parameter Model Activates 10B, Near Claude Sonnet for One‑Tenth the Cost
AI Insight Log
AI Insight Log
Feb 5, 2026 · Artificial Intelligence

GPT-5.3-Codex vs Claude Opus 4.6: Is the 15% Terminal Coding Boost the Real Game‑Changer for Developers?

The article objectively compares OpenAI's GPT‑5.3‑Codex and Anthropic's Claude Opus 4.6 across Terminal‑Bench 2.0 and SWE‑Bench, revealing a 15% terminal‑coding edge for Codex, modest gains in pure code generation, and a strategic split between specialist and generalist AI approaches.

AI model comparisonClaude Opus 4.6GPT-5.3-Codex
0 likes · 9 min read
GPT-5.3-Codex vs Claude Opus 4.6: Is the 15% Terminal Coding Boost the Real Game‑Changer for Developers?
AI Insight Log
AI Insight Log
Feb 2, 2026 · Artificial Intelligence

Is Claude Sonnet 5 (Fennec) Really Coming? Leaked Specs Suggest Performance May Beat Opus 4.5

A leaked Google Vertex AI log reveals a new model ID claude‑sonnet‑5@20260203, hinting at a Feb 3 2026 release of Claude Sonnet 5 (code‑named “Fennec”) that reportedly scores over 82 % on SWE‑Bench, outperforms Opus 4.5, keeps the same pricing, and introduces a “Dev Team” mode with parallel sub‑agents for coding tasks.

AI modelClaude Sonnet 5Fennec
0 likes · 5 min read
Is Claude Sonnet 5 (Fennec) Really Coming? Leaked Specs Suggest Performance May Beat Opus 4.5
JD Tech Talk
JD Tech Talk
Jan 9, 2026 · Artificial Intelligence

How JoyCode Agent Scored 74.6% Pass@1 on SWE‑bench Verified with a Patch‑Test Co‑generation Loop

JoyCode Agent leverages a patch‑test co‑generation and iterative validation framework to achieve a 74.6% Pass@1 score on the SWE‑bench Verified benchmark, reducing resource consumption by 30‑50% and introducing a closed‑loop multi‑agent pipeline that integrates testing, patch generation, trajectory compression, similarity retrieval, and decision arbitration.

LLMMulti-AgentSWE-bench
0 likes · 41 min read
How JoyCode Agent Scored 74.6% Pass@1 on SWE‑bench Verified with a Patch‑Test Co‑generation Loop
JD Tech Talk
JD Tech Talk
Jan 9, 2026 · Artificial Intelligence

How JoyCode Agent Reached 74.6% Pass@1 on SWE‑bench Verified with a Patch‑Test Co‑Generation Loop

This technical report details JoyCode Agent’s end‑to‑end pipeline that couples patch generation with fail‑to‑pass and pass‑to‑pass test creation, uses trajectory compression, CSR similarity retrieval, and multi‑agent iterative retries to achieve a 74.6% Pass@1 score on the SWE‑bench Verified benchmark while cutting compute costs by 30‑50%.

AI code repairAutomated TestingMulti-Agent System
0 likes · 38 min read
How JoyCode Agent Reached 74.6% Pass@1 on SWE‑bench Verified with a Patch‑Test Co‑Generation Loop
DataFunTalk
DataFunTalk
Dec 24, 2025 · Artificial Intelligence

Can MiniMax M2.1 Match Top Coding AIs? A Hands‑On Benchmark Review

This article evaluates MiniMax M2.1’s new coding capabilities across multiple benchmarks, including SWE‑bench, Java satellite‑control projects, full‑stack attack visualizations, and a one‑click mobile‑OS simulation, comparing its performance to Claude Sonnet 4.5 and Opus 4.5.

AI coding assistantM2.1MiniMax
0 likes · 8 min read
Can MiniMax M2.1 Match Top Coding AIs? A Hands‑On Benchmark Review
AI Insight Log
AI Insight Log
Dec 11, 2025 · Artificial Intelligence

GPT-5.2 Released: How It Outperforms Claude 4.5 and Gemini 3 Pro

OpenAI’s GPT‑5.2 launch introduces three specialized modes, achieves a record 55.6% score on SWE‑Bench Pro, demonstrates strong front‑end generation, adds a /compact API for long‑context efficiency, offers tiered pricing with cache discounts, and improves safety for younger users.

AI SafetyAI benchmarkingGPT-5.2
0 likes · 6 min read
GPT-5.2 Released: How It Outperforms Claude 4.5 and Gemini 3 Pro
JD Tech Talk
JD Tech Talk
Nov 3, 2025 · Artificial Intelligence

How JoyCode Agent Achieves 74.6% Pass@1 on SWE‑bench Verified with Patch‑Test Co‑generation

JoyCode Agent reaches a 74.6% pass rate on the authoritative SWE‑bench Verified benchmark, ranking in the global top‑3, and is now open‑source, showcasing a high‑efficiency, test‑driven, iterative approach to automated code repair that dramatically reduces token consumption while improving success rates.

Artificial IntelligenceAutomated Code RepairBenchmarking
0 likes · 44 min read
How JoyCode Agent Achieves 74.6% Pass@1 on SWE‑bench Verified with Patch‑Test Co‑generation
JD Cloud Developers
JD Cloud Developers
Nov 3, 2025 · Artificial Intelligence

How JoyCode Agent Scored 74.6% Pass@1 on SWE‑Bench Verified: Inside the Patch‑Test Co‑generation Pipeline

JoyCode Agent leverages a multi‑agent, patch‑and‑test co‑generation framework with iterative validation, failure attribution, and experience‑driven retries to achieve a 74.6% Pass@1 rate on the SWE‑Bench Verified benchmark, dramatically reducing computational resources while delivering high‑quality code patches.

AI code generationMulti-Agent SystemSWE-bench
0 likes · 34 min read
How JoyCode Agent Scored 74.6% Pass@1 on SWE‑Bench Verified: Inside the Patch‑Test Co‑generation Pipeline
DataFunTalk
DataFunTalk
Sep 12, 2025 · Artificial Intelligence

How Shunyu Yao is Shaping the Second Half of AI with Agents

Shunyu Yao, a Princeton‑trained AI researcher who rose through Tsinghua’s elite Yao class and OpenAI, is known for pioneering works like Tree of Thoughts, SWE‑bench, and ReAct, and now focuses on building general‑purpose agents and exploring the “second half” of AI development.

AI researchReactSWE-bench
0 likes · 12 min read
How Shunyu Yao is Shaping the Second Half of AI with Agents
Data Party THU
Data Party THU
Jul 31, 2025 · Industry Insights

How mini‑SWE‑agent Solves 65% of SWE‑bench Bugs with Only 100 Lines of Code

The mini‑SWE‑agent, a lightweight open‑source software‑engineering AI built by the original SWE‑bench team, achieves about 65% bug‑fix success on the SWE‑bench benchmark using roughly 100 lines of Python, thanks to its minimal dependencies, shell‑based execution, linear history, and support for various container environments, offering a fast, extensible alternative to the full‑featured SWE‑agent.

AI AgentLLMSWE-bench
0 likes · 8 min read
How mini‑SWE‑agent Solves 65% of SWE‑bench Bugs with Only 100 Lines of Code
DataFunTalk
DataFunTalk
Jun 17, 2025 · Artificial Intelligence

Kimi-Dev-72B Sets New Open‑Source SOTA on SWE‑bench Verified (60.4% Score)

Kimi-Dev-72B, an open-source 72-billion-parameter code model from Moonshot AI, achieved a record 60.4% score on the SWE-bench Verified benchmark, surpassing larger models, and incorporates BugFixer/TestWriter dual roles, extensive mid-stage training on billions of GitHub data, and reinforcement-learning-driven self-play, with code available on Hugging Face and GitHub.

Reinforcement LearningSWE-benchai
0 likes · 7 min read
Kimi-Dev-72B Sets New Open‑Source SOTA on SWE‑bench Verified (60.4% Score)
Infra Learning Club
Infra Learning Club
Apr 4, 2025 · Artificial Intelligence

Testing Augment Code: A Powerful New Rival to Cursor

The article evaluates Augment Code, an AI‑powered coding assistant with 200K token context, persistent memory, multimodal input, and top SWE‑bench scores, walks through its installation, explores its use on vllm and PagedAttention, demonstrates adding a new model and auto‑generating a WeChat mini‑program, and compares its capabilities and speed to Cursor.

AI coding assistantAugment CodeCursor
0 likes · 8 min read
Testing Augment Code: A Powerful New Rival to Cursor
Continuous Delivery 2.0
Continuous Delivery 2.0
Jul 3, 2024 · Artificial Intelligence

Applying Large Language Models to Software Engineering: Challenges, Cross‑File Editing Issues, Bug‑Fixing Evaluation, and SWE‑Bench Results

This article examines the practical challenges of using large language models in software development, including handling long contexts, cross‑file editing, bug‑fixing evaluation methods, and presents benchmark results from SWE‑Bench and its Lite subset to assess model capabilities.

Cross-File EditingLLMSWE-bench
0 likes · 7 min read
Applying Large Language Models to Software Engineering: Challenges, Cross‑File Editing Issues, Bug‑Fixing Evaluation, and SWE‑Bench Results