Tagged articles
27 articles
Page 1 of 1
Machine Heart
Machine Heart
May 4, 2026 · Artificial Intelligence

Mega MoE vs SonicMoE: Which Will Lead the Next AI Speed Race?

SonicMoE, a new ultra‑fast Mixture‑of‑Experts model from Tri Dao and Ion Stoica’s team, achieves peak throughput on Nvidia Blackwell GPUs, outperforms DeepSeek’s DeepGEMM, and introduces algorithmic redesigns that decouple activation memory from expert granularity while fusing I/O‑aware kernels for up to double the speed of existing MoE frameworks.

AI PerformanceBlackwellGPU Acceleration
0 likes · 12 min read
Mega MoE vs SonicMoE: Which Will Lead the Next AI Speed Race?
Java Tech Enthusiast
Java Tech Enthusiast
Apr 10, 2026 · Industry Insights

Why Claude’s Performance Is Dropping: Data‑Driven Insights into AI Model Degradation

Since early 2024, Claude users have reported shallower reasoning, frequent failures, and soaring token costs, and an analysis of 6,852 logs reveals a 67% drop in thinking depth, disabled plan mode, and an 80‑fold increase in API expenses, highlighting a concerning industry‑wide trend of silent AI model downgrades.

AI PerformanceAI model degradationAnthropic
0 likes · 9 min read
Why Claude’s Performance Is Dropping: Data‑Driven Insights into AI Model Degradation
Lao Guo's Learning Space
Lao Guo's Learning Space
Apr 8, 2026 · Artificial Intelligence

2026 Qwen Model Comparison: Choose the Right Qwen for Your Mac Studio

An in‑depth 2026 comparative review of Alibaba’s Qwen series (Qwen2.5, Qwen3, Qwen3.5) evaluates architecture, performance, speed and VRAM usage on Mac Studio, ranks each variant, and provides concrete model‑selection guidance for different memory configurations, highlighting the MoE‑based Qwen3.5 as the optimal choice.

AI PerformanceMac StudioMoE
0 likes · 9 min read
2026 Qwen Model Comparison: Choose the Right Qwen for Your Mac Studio
DataFunSummit
DataFunSummit
Mar 21, 2026 · Artificial Intelligence

How Slidebatching Revolutionizes LLM Inference Scheduling for Faster, More Efficient AI Services

The article examines the memory and latency challenges of 1750‑billion‑parameter LLM inference, introduces the xLLM framework’s Slidebatching and PD‑separation scheduling strategies, and details how these techniques achieve up to 35% system‑throughput gains and 52% SLO compliance improvements in real‑world multi‑priority workloads.

AI PerformanceLLMPD separation
0 likes · 15 min read
How Slidebatching Revolutionizes LLM Inference Scheduling for Faster, More Efficient AI Services
AI Insight Log
AI Insight Log
Mar 14, 2026 · Artificial Intelligence

Opus 4.6 Unlocks Full 1M‑Token Context—GPT‑5.4 Slumps to 36% Accuracy

Anthropic opened its million‑token context window for Claude Opus 4.6, showing a 78.3% MRCR v2 accuracy while competing models like GPT‑5.4 and Gemini 3.1 Pro fall below 40%, and the release also removes pricing premiums, expands media limits six‑fold, and requires no code changes, dramatically improving Claude Code workflows.

AI PerformanceAnthropicClaude Opus
0 likes · 8 min read
Opus 4.6 Unlocks Full 1M‑Token Context—GPT‑5.4 Slumps to 36% Accuracy
DataFunTalk
DataFunTalk
Oct 16, 2025 · Artificial Intelligence

Apple’s M5 Chip Powers a New AI Surge Across MacBook, iPad, and Vision Pro

Apple quietly updated its website to launch a 14‑inch MacBook Pro, iPad Pro, and Vision Pro equipped with the new M5 chip, delivering up to four‑plus times higher AI throughput, 45% faster graphics, and a 30% wider unified memory bandwidth while keeping prices unchanged, and introducing features such as 120 Hz external display support and enhanced spatial computing.

AI PerformanceAppleM5 chip
0 likes · 12 min read
Apple’s M5 Chip Powers a New AI Surge Across MacBook, iPad, and Vision Pro
AntTech
AntTech
Oct 13, 2025 · Artificial Intelligence

How dInfer Accelerates Diffusion LLM Inference Over 10× Faster Than Fast‑dLLM

Ant Group's open‑source dInfer framework dramatically speeds up diffusion language model inference—achieving more than a ten‑fold boost over Fast‑dLLM, surpassing autoregressive baselines, and delivering 1011 tokens per second on HumanEval—by tackling computational cost, KV‑cache invalidation, and parallel decoding challenges through modular system‑level innovations.

AI PerformanceDiffusion Language ModelInference Optimization
0 likes · 11 min read
How dInfer Accelerates Diffusion LLM Inference Over 10× Faster Than Fast‑dLLM
Baidu Tech Salon
Baidu Tech Salon
Oct 10, 2025 · Artificial Intelligence

Navigating the 2025 AI Model Boom: Practical Evaluation Strategies

This article examines the rapid surge of large AI models in 2024‑2025, critiques the reliability of public leaderboards, and presents a business‑focused evaluation framework—including dataset construction, metric selection, automation, and LLM‑as‑judge techniques—to help developers choose the right model for real‑world applications.

AI PerformanceAI benchmarksLLM-as-judge
0 likes · 17 min read
Navigating the 2025 AI Model Boom: Practical Evaluation Strategies
Architects' Tech Alliance
Architects' Tech Alliance
Sep 30, 2025 · Artificial Intelligence

How KV Cache and CachedAttention Revolutionize LLM Inference Efficiency

This article explains how key‑value (KV) caching and the new CachedAttention technique dramatically reduce large‑language‑model inference costs by reusing stored attention data across dialogue turns, leveraging a three‑tier memory hierarchy of HBM, DRAM, and SSD to overcome bandwidth and capacity bottlenecks.

AI PerformanceCachedAttentionKV cache
0 likes · 8 min read
How KV Cache and CachedAttention Revolutionize LLM Inference Efficiency
Instant Consumer Technology Team
Instant Consumer Technology Team
Sep 28, 2025 · Artificial Intelligence

Why Chinese AI Agents Lead at Home but Lag Abroad – Key Findings from the 2025 Enterprise AI Agent Report

The 2025 Enterprise AI Agent Research Report reveals that domestic Chinese agents excel in localized tasks and data precision, while international agents dominate in generalization, speed, and iterative efficiency, highlighting six critical adoption metrics and showcasing diverse industry case studies that illustrate the current AI Agent landscape and future opportunities.

AI PerformanceAI adoptionAI agents
0 likes · 20 min read
Why Chinese AI Agents Lead at Home but Lag Abroad – Key Findings from the 2025 Enterprise AI Agent Report
AI Algorithm Path
AI Algorithm Path
Aug 20, 2025 · Artificial Intelligence

DeepSeek V3.1 Open‑Source: Unlocking a New Era of Long‑Context AI

DeepSeek V3.1, a 685‑billion‑parameter open‑source model, supports up to 128,000 tokens, delivers mixed‑architecture capabilities, matches top‑tier closed systems in benchmarks, and its rapid community adoption signals a shift toward democratized AI development and new industry dynamics.

AI PerformanceDeepSeeklarge language model
0 likes · 6 min read
DeepSeek V3.1 Open‑Source: Unlocking a New Era of Long‑Context AI
AI Frontier Lectures
AI Frontier Lectures
Jul 29, 2025 · Industry Insights

SpecForge: Open‑Source Framework Boosts Large‑Model Speculative Sampling by 2.18×

SpecForge, an open‑source training framework built on Eagle3, enables end‑to‑end speculative sampling for ultra‑large language models, integrates tightly with the SGLang inference engine, offers online and offline training modes, supports advanced parallelism strategies, and demonstrates up to 2.18× inference speedup on benchmark tests, with all code and pretrained drafts available on GitHub and Hugging Face.

AI PerformanceInference AccelerationSpeculative Sampling
0 likes · 9 min read
SpecForge: Open‑Source Framework Boosts Large‑Model Speculative Sampling by 2.18×
IT Services Circle
IT Services Circle
Jul 22, 2025 · Artificial Intelligence

Why Kimi K2 Overtook DeepSeek to Become the Top Open‑Source AI Model

Kimi K2 has surged to the global open‑source #1 spot, ranking fifth overall and rivaling top closed‑source models, thanks to strong multi‑turn dialogue, programming, and complex‑prompt abilities, extensive community adoption, and a refined DeepSeek V3‑based architecture.

AI PerformanceDeepSeek-V3Kimi K2
0 likes · 8 min read
Why Kimi K2 Overtook DeepSeek to Become the Top Open‑Source AI Model
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jul 16, 2025 · Artificial Intelligence

ChunkFlow: Accelerating Long‑Context Model Fine‑Tuning Up to 4.5× Faster

The paper introduces ChunkFlow, an efficient training framework for variable‑length and ultra‑long sequence datasets that powers Qwen models, achieving up to 4.53× speedup over Megatron‑LM and more than 2× overall performance gains by reorganizing data into fixed‑size chunks and employing a state‑aware scheduler.

AI PerformanceChunkFlowDistributed Training
0 likes · 7 min read
ChunkFlow: Accelerating Long‑Context Model Fine‑Tuning Up to 4.5× Faster
Tencent Technical Engineering
Tencent Technical Engineering
Jul 11, 2025 · Artificial Intelligence

How DeepSeek Achieved 15,800+ Tokens/s: Full‑Stack Inference Optimizations

This article details the Angel‑HCF team's end‑to‑end DeepSeek inference optimizations—including PD separation, multi‑layer MTP, EP and DP parallelism, hardware‑aware kernels, and load‑balancing strategies—that boost throughput to over 15,800 tokens per second while keeping per‑token latency under 50 ms.

AI PerformanceDeepSeekGPU utilization
0 likes · 13 min read
How DeepSeek Achieved 15,800+ Tokens/s: Full‑Stack Inference Optimizations
Instant Consumer Technology Team
Instant Consumer Technology Team
Jul 11, 2025 · Artificial Intelligence

Why NVLink Boosts Multi‑GPU Inference: Tensor Parallelism Explained

A recent migration of a multimodal image inference system from an internal network to a cloud environment revealed that NVLink bridges dramatically improve multi‑GPU inference speed by reducing inter‑GPU communication overhead, while tensor‑parallel and data‑parallel strategies each have distinct trade‑offs for model deployment.

AI PerformanceData ParallelGPU inference
0 likes · 11 min read
Why NVLink Boosts Multi‑GPU Inference: Tensor Parallelism Explained
DataFunSummit
DataFunSummit
Jul 5, 2025 · Artificial Intelligence

Boosting Large Model Training: Optimizing Performance with the Verl Framework

Join the DataFun Summit 2025 on July 12 to hear Tencent FinTech senior researcher Gong Dihong discuss how redesigning the Verl training system, integrating Megatron and Sglang, and applying new synchronization and offloading techniques dramatically speeds up large‑model reinforcement‑learning training.

AI PerformanceMegatronTraining Optimization
0 likes · 4 min read
Boosting Large Model Training: Optimizing Performance with the Verl Framework
Efficient Ops
Efficient Ops
May 29, 2025 · Artificial Intelligence

DeepSeek R1 0528 Update: New Features, Performance Gains Over OpenAI o3

DeepSeek quietly launched the R1 0528 model, which early testers report matches OpenAI’s o3 in benchmarks and style, while adding deeper chain‑of‑thought reasoning, better writing output, and extended thinking windows, and the announcement is followed by a promotion for the GOPS Global Ops Conference.

AI PerformanceDeepSeekModel Update
0 likes · 3 min read
DeepSeek R1 0528 Update: New Features, Performance Gains Over OpenAI o3
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
May 14, 2025 · Artificial Intelligence

How Mooncake’s KVCache Boosts Large‑Model Inference Efficiency and Cost

Mooncake, an open‑source large‑model inference platform, introduces a KVCache‑centric architecture that dramatically improves throughput, reduces latency and cuts inference costs by up to 20%, while integrating with frameworks like SGLang and vLLM and leveraging Alibaba Cloud’s eRDMA and GPUDirect technologies for scalable, high‑performance deployments.

AI PerformanceAlibaba CloudDistributed Systems
0 likes · 7 min read
How Mooncake’s KVCache Boosts Large‑Model Inference Efficiency and Cost
Ops Development & AI Practice
Ops Development & AI Practice
Apr 2, 2025 · Artificial Intelligence

How Cache‑Augmented Generation (CAG) Supercharges LLM Inference

Cache‑Augmented Generation (CAG) speeds up large language model text generation by caching the Transformer attention layer’s key‑value states, dramatically reducing the quadratic compute cost of autoregressive decoding while keeping the model’s knowledge unchanged.

AI PerformanceCAGCache‑augmented generation
0 likes · 9 min read
How Cache‑Augmented Generation (CAG) Supercharges LLM Inference
AI Frontier Lectures
AI Frontier Lectures
Mar 17, 2025 · Artificial Intelligence

Can Diffusion Models Outrun Traditional LLMs? Mercury Coder’s Speed & Architecture

The article analyzes Mercury Coder, a diffusion‑based language model that generates text and code in parallel, compares its speed and quality against traditional autoregressive LLMs like GPT‑4o‑mini using a ball‑collision benchmark, and discusses the underlying score‑entropy training, current limitations, and future multimodal potential.

AI PerformanceBenchmarkMercury
0 likes · 8 min read
Can Diffusion Models Outrun Traditional LLMs? Mercury Coder’s Speed & Architecture
Baobao Algorithm Notes
Baobao Algorithm Notes
Jan 7, 2025 · Artificial Intelligence

How Efficient Is DeepSeek V3? Calculating Its MFU Around 37%

This article derives DeepSeek V3's training Model FLOPs Utilization (MFU) using publicly available data, showing an MFU of roughly 37%—about a 60% improvement over V2—and provides detailed formulas, parameter settings, and a reproducible Python script.

AI PerformanceDeepSeekMFU
0 likes · 8 min read
How Efficient Is DeepSeek V3? Calculating Its MFU Around 37%
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Dec 29, 2022 · Artificial Intelligence

Boost Swin Transformer Speed: Profiling, Mixed Precision, and Operator Fusion Techniques

This article details how to use NVIDIA profiling tools, mixed‑precision training, operator fusion, kernel optimizations, and INT8 quantization to identify and eliminate performance bottlenecks in Swin Transformer models, achieving up to 2.85× training speedup and up to 7.34× inference acceleration on modern GPUs.

AI PerformanceGPU OptimizationOperator fusion
0 likes · 23 min read
Boost Swin Transformer Speed: Profiling, Mixed Precision, and Operator Fusion Techniques
DataFunTalk
DataFunTalk
Mar 17, 2022 · Artificial Intelligence

Optimizing Distributed Machine Learning Training on Google Vertex AI: Fast Socket and Reduction Server

This article explains how Google Vertex AI tackles the memory‑wall challenge of large‑scale distributed training by introducing Fast Socket, a high‑performance NCCL network stack, and a Reduction Server that halves gradient‑aggregation traffic, delivering significant speed‑up and cost‑reduction for AI workloads.

AI PerformanceCloud AIFast Socket
0 likes · 19 min read
Optimizing Distributed Machine Learning Training on Google Vertex AI: Fast Socket and Reduction Server
Programmer DD
Programmer DD
Dec 17, 2020 · Artificial Intelligence

Can Huang’s Law Double AI Performance Every Two Years? NVIDIA GTC 2020 Insights

At NVIDIA’s GTC China 2020, chief scientist Bill Dally highlighted the “Huang’s Law” predicting GPU-driven AI performance to double biennially, introduced projects like MAGNet, optical interconnects, and the Legate programming model, and discussed the broader implications for AI ecosystem development and industry adoption.

AI PerformanceGPUHuang's Law
0 likes · 8 min read
Can Huang’s Law Double AI Performance Every Two Years? NVIDIA GTC 2020 Insights