Tagged articles

AI performance

28 articles · Page 1 of 1

Jun 1, 2026 · Artificial Intelligence

When CPUs Hide GPU Bottlenecks: How Btune 2.0’s Automated Latency Analysis Breaks the Performance Black Box

The article examines how hidden CPU‑GPU coordination issues can cripple AI inference performance, illustrates a real‑world XPU migration case where a kernel lock in the halolet component throttled throughput, and shows how Btune 2.0’s automated latency analysis and AI agent automatically pinpoint and resolve such bottlenecks.

AI performanceBtune 2.0CPU-GPU bottleneck

0 likes · 10 min read

When CPUs Hide GPU Bottlenecks: How Btune 2.0’s Automated Latency Analysis Breaks the Performance Black Box

Machine Heart

May 4, 2026 · Artificial Intelligence

Mega MoE vs SonicMoE: Which Will Lead the Next AI Speed Race?

SonicMoE, a new ultra‑fast Mixture‑of‑Experts model from Tri Dao and Ion Stoica’s team, achieves peak throughput on Nvidia Blackwell GPUs, outperforms DeepSeek’s DeepGEMM, and introduces algorithmic redesigns that decouple activation memory from expert granularity while fusing I/O‑aware kernels for up to double the speed of existing MoE frameworks.

AI performanceBlackwellGPU Acceleration

0 likes · 12 min read

Mega MoE vs SonicMoE: Which Will Lead the Next AI Speed Race?

Java Tech Enthusiast

Apr 10, 2026 · Industry Insights

Why Claude’s Performance Is Dropping: Data‑Driven Insights into AI Model Degradation

Since early 2024, Claude users have reported shallower reasoning, frequent failures, and soaring token costs, and an analysis of 6,852 logs reveals a 67% drop in thinking depth, disabled plan mode, and an 80‑fold increase in API expenses, highlighting a concerning industry‑wide trend of silent AI model downgrades.

AI model degradationAI performanceAnthropic

0 likes · 9 min read

Why Claude’s Performance Is Dropping: Data‑Driven Insights into AI Model Degradation

Lao Guo's Learning Space

Apr 8, 2026 · Artificial Intelligence

2026 Qwen Model Comparison: Choose the Right Qwen for Your Mac Studio

An in‑depth 2026 comparative review of Alibaba’s Qwen series (Qwen2.5, Qwen3, Qwen3.5) evaluates architecture, performance, speed and VRAM usage on Mac Studio, ranks each variant, and provides concrete model‑selection guidance for different memory configurations, highlighting the MoE‑based Qwen3.5 as the optimal choice.

AI performanceMac StudioMoE

0 likes · 9 min read

2026 Qwen Model Comparison: Choose the Right Qwen for Your Mac Studio

DataFunSummit

Mar 21, 2026 · Artificial Intelligence

How Slidebatching Revolutionizes LLM Inference Scheduling for Faster, More Efficient AI Services

The article examines the memory and latency challenges of 1750‑billion‑parameter LLM inference, introduces the xLLM framework’s Slidebatching and PD‑separation scheduling strategies, and details how these techniques achieve up to 35% system‑throughput gains and 52% SLO compliance improvements in real‑world multi‑priority workloads.

AI performanceLLMPD separation

0 likes · 15 min read

How Slidebatching Revolutionizes LLM Inference Scheduling for Faster, More Efficient AI Services

AI Insight Log

Mar 14, 2026 · Artificial Intelligence

Opus 4.6 Unlocks Full 1M‑Token Context—GPT‑5.4 Slumps to 36% Accuracy

Anthropic opened its million‑token context window for Claude Opus 4.6, showing a 78.3% MRCR v2 accuracy while competing models like GPT‑5.4 and Gemini 3.1 Pro fall below 40%, and the release also removes pricing premiums, expands media limits six‑fold, and requires no code changes, dramatically improving Claude Code workflows.

AI performanceAnthropicClaude Opus

0 likes · 8 min read

Opus 4.6 Unlocks Full 1M‑Token Context—GPT‑5.4 Slumps to 36% Accuracy

DataFunTalk

Oct 16, 2025 · Artificial Intelligence

Apple’s M5 Chip Powers a New AI Surge Across MacBook, iPad, and Vision Pro

Apple quietly updated its website to launch a 14‑inch MacBook Pro, iPad Pro, and Vision Pro equipped with the new M5 chip, delivering up to four‑plus times higher AI throughput, 45% faster graphics, and a 30% wider unified memory bandwidth while keeping prices unchanged, and introducing features such as 120 Hz external display support and enhanced spatial computing.

AI performanceAppleM5 chip

0 likes · 12 min read

Apple’s M5 Chip Powers a New AI Surge Across MacBook, iPad, and Vision Pro

AntTech

Oct 13, 2025 · Artificial Intelligence

How dInfer Accelerates Diffusion LLM Inference Over 10× Faster Than Fast‑dLLM

Ant Group's open‑source dInfer framework dramatically speeds up diffusion language model inference—achieving more than a ten‑fold boost over Fast‑dLLM, surpassing autoregressive baselines, and delivering 1011 tokens per second on HumanEval—by tackling computational cost, KV‑cache invalidation, and parallel decoding challenges through modular system‑level innovations.

AI performanceDiffusion Language ModelInference Optimization

0 likes · 11 min read

How dInfer Accelerates Diffusion LLM Inference Over 10× Faster Than Fast‑dLLM

Baidu Tech Salon

Oct 10, 2025 · Artificial Intelligence

Navigating the 2025 AI Model Boom: Practical Evaluation Strategies

This article examines the rapid surge of large AI models in 2024‑2025, critiques the reliability of public leaderboards, and presents a business‑focused evaluation framework—including dataset construction, metric selection, automation, and LLM‑as‑judge techniques—to help developers choose the right model for real‑world applications.

AI benchmarksAI performanceDataset Construction

0 likes · 17 min read

Navigating the 2025 AI Model Boom: Practical Evaluation Strategies

Architects' Tech Alliance

Sep 30, 2025 · Artificial Intelligence

How KV Cache and CachedAttention Revolutionize LLM Inference Efficiency

This article explains how key‑value (KV) caching and the new CachedAttention technique dramatically reduce large‑language‑model inference costs by reusing stored attention data across dialogue turns, leveraging a three‑tier memory hierarchy of HBM, DRAM, and SSD to overcome bandwidth and capacity bottlenecks.

AI performanceCachedAttentionKV cache

0 likes · 8 min read

How KV Cache and CachedAttention Revolutionize LLM Inference Efficiency

Instant Consumer Technology Team

Sep 28, 2025 · Artificial Intelligence

Why Chinese AI Agents Lead at Home but Lag Abroad – Key Findings from the 2025 Enterprise AI Agent Report

The 2025 Enterprise AI Agent Research Report reveals that domestic Chinese agents excel in localized tasks and data precision, while international agents dominate in generalization, speed, and iterative efficiency, highlighting six critical adoption metrics and showcasing diverse industry case studies that illustrate the current AI Agent landscape and future opportunities.

AI adoptionAI agentsAI case studies

0 likes · 20 min read

Why Chinese AI Agents Lead at Home but Lag Abroad – Key Findings from the 2025 Enterprise AI Agent Report

AI Algorithm Path

Aug 20, 2025 · Artificial Intelligence

DeepSeek V3.1 Open‑Source: Unlocking a New Era of Long‑Context AI

DeepSeek V3.1, a 685‑billion‑parameter open‑source model, supports up to 128,000 tokens, delivers mixed‑architecture capabilities, matches top‑tier closed systems in benchmarks, and its rapid community adoption signals a shift toward democratized AI development and new industry dynamics.

AI performanceDeepSeekLong Context

0 likes · 6 min read

DeepSeek V3.1 Open‑Source: Unlocking a New Era of Long‑Context AI

AI Frontier Lectures

Jul 29, 2025 · Industry Insights

SpecForge: Open‑Source Framework Boosts Large‑Model Speculative Sampling by 2.18×

SpecForge, an open‑source training framework built on Eagle3, enables end‑to‑end speculative sampling for ultra‑large language models, integrates tightly with the SGLang inference engine, offers online and offline training modes, supports advanced parallelism strategies, and demonstrates up to 2.18× inference speedup on benchmark tests, with all code and pretrained drafts available on GitHub and Hugging Face.

AI performanceOpen-sourceSpeculative Sampling

0 likes · 9 min read

SpecForge: Open‑Source Framework Boosts Large‑Model Speculative Sampling by 2.18×

IT Services Circle

Jul 22, 2025 · Artificial Intelligence

Why Kimi K2 Overtook DeepSeek to Become the Top Open‑Source AI Model

Kimi K2 has surged to the global open‑source #1 spot, ranking fifth overall and rivaling top closed‑source models, thanks to strong multi‑turn dialogue, programming, and complex‑prompt abilities, extensive community adoption, and a refined DeepSeek V3‑based architecture.

AI performanceDeepSeek-V3Kimi K2

0 likes · 8 min read

Why Kimi K2 Overtook DeepSeek to Become the Top Open‑Source AI Model

Alibaba Cloud Big Data AI Platform

Jul 16, 2025 · Artificial Intelligence

ChunkFlow: Accelerating Long‑Context Model Fine‑Tuning Up to 4.5× Faster

The paper introduces ChunkFlow, an efficient training framework for variable‑length and ultra‑long sequence datasets that powers Qwen models, achieving up to 4.53× speedup over Megatron‑LM and more than 2× overall performance gains by reorganizing data into fixed‑size chunks and employing a state‑aware scheduler.

AI performanceChunkFlowGPU efficiency

0 likes · 7 min read

ChunkFlow: Accelerating Long‑Context Model Fine‑Tuning Up to 4.5× Faster

Tencent Technical Engineering

Jul 11, 2025 · Artificial Intelligence

How DeepSeek Achieved 15,800+ Tokens/s: Full‑Stack Inference Optimizations

This article details the Angel‑HCF team's end‑to‑end DeepSeek inference optimizations—including PD separation, multi‑layer MTP, EP and DP parallelism, hardware‑aware kernels, and load‑balancing strategies—that boost throughput to over 15,800 tokens per second while keeping per‑token latency under 50 ms.

AI performanceDeepSeekGPU Utilization

0 likes · 13 min read

How DeepSeek Achieved 15,800+ Tokens/s: Full‑Stack Inference Optimizations

Instant Consumer Technology Team

Jul 11, 2025 · Artificial Intelligence

Why NVLink Boosts Multi‑GPU Inference: Tensor Parallelism Explained

A recent migration of a multimodal image inference system from an internal network to a cloud environment revealed that NVLink bridges dramatically improve multi‑GPU inference speed by reducing inter‑GPU communication overhead, while tensor‑parallel and data‑parallel strategies each have distinct trade‑offs for model deployment.

AI performanceData ParallelGPU inference

0 likes · 11 min read

Why NVLink Boosts Multi‑GPU Inference: Tensor Parallelism Explained

DataFunSummit

Jul 5, 2025 · Artificial Intelligence

Boosting Large Model Training: Optimizing Performance with the Verl Framework

Join the DataFun Summit 2025 on July 12 to hear Tencent FinTech senior researcher Gong Dihong discuss how redesigning the Verl training system, integrating Megatron and Sglang, and applying new synchronization and offloading techniques dramatically speeds up large‑model reinforcement‑learning training.

AI performanceMegatronTraining Optimization

0 likes · 4 min read

Boosting Large Model Training: Optimizing Performance with the Verl Framework

Efficient Ops

May 29, 2025 · Artificial Intelligence

DeepSeek R1 0528 Update: New Features, Performance Gains Over OpenAI o3

DeepSeek quietly launched the R1 0528 model, which early testers report matches OpenAI’s o3 in benchmarks and style, while adding deeper chain‑of‑thought reasoning, better writing output, and extended thinking windows, and the announcement is followed by a promotion for the GOPS Global Ops Conference.

AI performanceChain-of-ThoughtDeepSeek

0 likes · 3 min read

DeepSeek R1 0528 Update: New Features, Performance Gains Over OpenAI o3

Alibaba Cloud Infrastructure

May 14, 2025 · Artificial Intelligence

How Mooncake’s KVCache Boosts Large‑Model Inference Efficiency and Cost

Mooncake, an open‑source large‑model inference platform, introduces a KVCache‑centric architecture that dramatically improves throughput, reduces latency and cuts inference costs by up to 20%, while integrating with frameworks like SGLang and vLLM and leveraging Alibaba Cloud’s eRDMA and GPUDirect technologies for scalable, high‑performance deployments.

AI performanceAlibaba CloudKVCache

0 likes · 7 min read

How Mooncake’s KVCache Boosts Large‑Model Inference Efficiency and Cost

Ops Development & AI Practice

Apr 2, 2025 · Artificial Intelligence

How Cache‑Augmented Generation (CAG) Supercharges LLM Inference

Cache‑Augmented Generation (CAG) speeds up large language model text generation by caching the Transformer attention layer’s key‑value states, dramatically reducing the quadratic compute cost of autoregressive decoding while keeping the model’s knowledge unchanged.

AI performanceCAGCache‑augmented generation

0 likes · 9 min read

How Cache‑Augmented Generation (CAG) Supercharges LLM Inference

AI Frontier Lectures

Mar 17, 2025 · Artificial Intelligence

Can Diffusion Models Outrun Traditional LLMs? Mercury Coder’s Speed & Architecture

The article analyzes Mercury Coder, a diffusion‑based language model that generates text and code in parallel, compares its speed and quality against traditional autoregressive LLMs like GPT‑4o‑mini using a ball‑collision benchmark, and discusses the underlying score‑entropy training, current limitations, and future multimodal potential.

AI performanceBenchmarkDiffusion Models

0 likes · 8 min read

Can Diffusion Models Outrun Traditional LLMs? Mercury Coder’s Speed & Architecture

macrozheng

Jan 20, 2025 · Artificial Intelligence

How Redis’s New Multithreaded Query Engine Boosts Vector Search for Real‑Time AI Apps

Redis has introduced a multithreaded query engine that dramatically lowers latency and multiplies throughput for vector‑based retrieval, enabling real‑time RAG applications to approach the 100 ms response target while scaling vertically to billions of documents.

AI performanceBenchmarkRAG

0 likes · 6 min read

How Redis’s New Multithreaded Query Engine Boosts Vector Search for Real‑Time AI Apps

Baobao Algorithm Notes

Jan 7, 2025 · Artificial Intelligence

How Efficient Is DeepSeek V3? Calculating Its MFU Around 37%

This article derives DeepSeek V3's training Model FLOPs Utilization (MFU) using publicly available data, showing an MFU of roughly 37%—about a 60% improvement over V2—and provides detailed formulas, parameter settings, and a reproducible Python script.

AI performanceDeepSeekMFU

0 likes · 8 min read

How Efficient Is DeepSeek V3? Calculating Its MFU Around 37%

Alibaba Cloud Big Data AI Platform

Aug 29, 2024 · Artificial Intelligence

How PAI-ChatLearn Accelerates Large‑Scale LLM Alignment Training

PAI-ChatLearn is an open‑source framework that abstracts and decouples alignment training for large language models, offering flexible resource scheduling, multi‑backend support, and significant speedups—up to 208% for 70B models—while supporting RLHF, DPO, and custom training flows.

AI performanceChatLearnLLM alignment

0 likes · 11 min read

How PAI-ChatLearn Accelerates Large‑Scale LLM Alignment Training

Baidu Intelligent Cloud Tech Hub

Dec 29, 2022 · Artificial Intelligence

Boost Swin Transformer Speed: Profiling, Mixed Precision, and Operator Fusion Techniques

This article details how to use NVIDIA profiling tools, mixed‑precision training, operator fusion, kernel optimizations, and INT8 quantization to identify and eliminate performance bottlenecks in Swin Transformer models, achieving up to 2.85× training speedup and up to 7.34× inference acceleration on modern GPUs.

AI performanceGPU OptimizationOperator fusion

0 likes · 23 min read

Boost Swin Transformer Speed: Profiling, Mixed Precision, and Operator Fusion Techniques

DataFunTalk

Mar 17, 2022 · Artificial Intelligence

Optimizing Distributed Machine Learning Training on Google Vertex AI: Fast Socket and Reduction Server

This article explains how Google Vertex AI tackles the memory‑wall challenge of large‑scale distributed training by introducing Fast Socket, a high‑performance NCCL network stack, and a Reduction Server that halves gradient‑aggregation traffic, delivering significant speed‑up and cost‑reduction for AI workloads.

AI performanceFast SocketNCCL

0 likes · 19 min read

Optimizing Distributed Machine Learning Training on Google Vertex AI: Fast Socket and Reduction Server

Programmer DD

Dec 17, 2020 · Artificial Intelligence

Can Huang’s Law Double AI Performance Every Two Years? NVIDIA GTC 2020 Insights

At NVIDIA’s GTC China 2020, chief scientist Bill Dally highlighted the “Huang’s Law” predicting GPU-driven AI performance to double biennially, introduced projects like MAGNet, optical interconnects, and the Legate programming model, and discussed the broader implications for AI ecosystem development and industry adoption.

AI performanceGPUHuang's Law

0 likes · 8 min read

Can Huang’s Law Double AI Performance Every Two Years? NVIDIA GTC 2020 Insights