Tagged articles
13 articles
Page 1 of 1
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 28, 2026 · Artificial Intelligence

vLLM 0.20 Arrives with DeepSeek V4 Support – What’s New?

The vLLM 0.20.0 release dramatically upgrades the inference engine with DeepSeek V4 support, default CUDA 13, PyTorch 2.11, Transformers v5 compatibility, FlashAttention 4 MLA prefill, TurboQuant 2‑bit KV cache, an online quantization front‑end, IR enhancements, Model Runner V2 features, and a slew of new models, while providing detailed installation and upgrade guidance.

CUDA 13DeepSeek-V4FlashAttention
0 likes · 10 min read
vLLM 0.20 Arrives with DeepSeek V4 Support – What’s New?
Data Party THU
Data Party THU
Mar 26, 2026 · Artificial Intelligence

How Mixture-of-Depths Attention Boosts Large Language Model Efficiency

This article examines the Mixture‑of‑Depths Attention (MoDA) mechanism, detailing its novel flash‑compatible KV layout, combined sequence‑depth attention, theoretical analysis, and extensive experiments that show significant reductions in validation loss and accuracy gains on downstream tasks compared to the OLMo2 baseline.

Attention MechanismDeep KVFlashAttention
0 likes · 9 min read
How Mixture-of-Depths Attention Boosts Large Language Model Efficiency
DeepHub IMBA
DeepHub IMBA
Mar 25, 2026 · Artificial Intelligence

TPU Architecture and Pallas Kernels: From Memory Hierarchy to FlashAttention

This article explains why TPU programming differs from GPU, describes the explicit HBM‑VMEM‑register data movement required on TPU, introduces the Pallas grid‑BlockSpec‑Ref model, and walks through four progressively more complex kernels—including element‑wise add, tiled dot product, fused RMSNorm with scratch memory, and a production‑grade FlashAttention implementation—showing how each kernel maps to the TPU memory hierarchy and leverages Pallas features such as input_output_aliases and PrefetchScalarGridSpec.

FlashAttentionJAXMemory Hierarchy
0 likes · 20 min read
TPU Architecture and Pallas Kernels: From Memory Hierarchy to FlashAttention
Old Zhang's AI Learning
Old Zhang's AI Learning
Mar 7, 2026 · Artificial Intelligence

vLLM 0.17.0 Release: Full Qwen 3.5 Support and Anthropic API Compatibility

The vLLM 0.17.0 release brings FlashAttention 4 integration, a mature Model Runner V2, complete Qwen 3.5 series support, a one‑click performance‑mode flag, Anthropic API compatibility, advanced weight‑offloading, broader hardware support beyond NVIDIA, ASR model integration, and detailed upgrade and installation guidance.

ASRAnthropic APIFlashAttention
0 likes · 12 min read
vLLM 0.17.0 Release: Full Qwen 3.5 Support and Anthropic API Compatibility
Code Mala Tang
Code Mala Tang
Mar 5, 2026 · Artificial Intelligence

Master YOLOv12: A Step‑by‑Step Guide to Build, Train, and Deploy Custom Models

This tutorial walks readers through the fundamentals of YOLOv12, covering model variants, dataset preparation with Roboflow, optional FlashAttention acceleration, installation, model selection, training commands, post‑training tasks such as tracking, validation, inference, exporting to ONNX, and benchmarking, all with concrete code snippets and practical tips.

Computer VisionFlashAttentionModel Training
0 likes · 8 min read
Master YOLOv12: A Step‑by‑Step Guide to Build, Train, and Deploy Custom Models
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Dec 19, 2025 · Artificial Intelligence

The 9 Key Ideas Behind FlashAttention

FlashAttention accelerates transformer inference by combining nine techniques—including loss‑less attention, GPU memory‑pyramid optimization, SRAM‑reusing tiling, safe softmax scaling, online buffering, tile‑size constraints, parallel multiplication, reduced KV slicing, and integrated backward‑pass caching—to achieve efficient, high‑throughput computation on modern GPUs.

Attention MechanismFlashAttentionGPU Optimization
0 likes · 8 min read
The 9 Key Ideas Behind FlashAttention
Bilibili Tech
Bilibili Tech
Mar 4, 2025 · Artificial Intelligence

Engineering Practices and Optimizations for Text‑to‑Video Generation Models (OpenSora, CogVideoX) on Bilibili TTV Team

The Bilibili TTV team optimized OpenSora and CogVideoX text‑to‑video models by redesigning data storage with Alluxio, parallelizing VAE encoding, applying dynamic sequence‑parallel and DeepSpeed‑Ulysses attention, adapting GPU code for NPU execution, leveraging profiling‑driven kernel fusion, FlashAttention, and expandable memory to dramatically increase training efficiency and frame throughput, while outlining future pipeline‑parallel and ZeRO‑3 scaling plans.

Diffusion TransformerFlashAttentionModel Parallelism
0 likes · 26 min read
Engineering Practices and Optimizations for Text‑to‑Video Generation Models (OpenSora, CogVideoX) on Bilibili TTV Team
NewBeeNLP
NewBeeNLP
Nov 18, 2024 · Artificial Intelligence

How to Optimize Multi-Head Attention: From MQA to FlashAttention and Beyond

This article examines various techniques for compressing and accelerating the KV cache in transformer models—including MQA, GQA, MLA, sliding‑window and linear attention, flash attention, page and ring attention, as well as mixed‑precision training and ZeRO parallelism—providing code snippets, implementation details, and practical trade‑offs.

FlashAttentionKV cacheModel Parallelism
0 likes · 17 min read
How to Optimize Multi-Head Attention: From MQA to FlashAttention and Beyond
Baobao Algorithm Notes
Baobao Algorithm Notes
Nov 7, 2024 · Artificial Intelligence

Demystifying FlashAttention: A Minimalist Derivation of the Algorithm

This article presents a concise, step‑by‑step derivation of FlashAttention, explaining the prerequisite linear‑algebra concepts, the softmax simplifications, and the parallel computation workflow—including the LSE‑enhanced version—so readers can grasp the algorithm’s elegance without heavy mathematics.

Algorithm DerivationAttention MechanismFlashAttention
0 likes · 8 min read
Demystifying FlashAttention: A Minimalist Derivation of the Algorithm
Sohu Tech Products
Sohu Tech Products
Sep 11, 2024 · Artificial Intelligence

How RoPE and FlashAttention Empower GLM-4-Plus for Long-Text Mastery

This article explains the core mechanisms of Transformer models, details the Rotational Position Embedding (RoPE) and FlashAttention techniques for handling long sequences, introduces the GLM-4-Plus series, and presents an empirical evaluation on the THUCNews dataset showing its superior long-text performance.

FlashAttentionGLM-4-PlusLong Text
0 likes · 13 min read
How RoPE and FlashAttention Empower GLM-4-Plus for Long-Text Mastery
DeWu Technology
DeWu Technology
May 15, 2024 · Artificial Intelligence

Accelerating Large Language Model Inference: Techniques and Framework Recommendations

Deploying a dedicated inference cluster and applying four key optimizations—FlashAttention‑based attention computation, PageAttention KV‑cache management, Mixture‑of‑Experts parameter reduction, and tensor parallelism—can accelerate large language model inference by up to 50% for models as large as 70 B parameters while cutting deployment costs.

FlashAttentionInference AccelerationMixture of Experts
0 likes · 17 min read
Accelerating Large Language Model Inference: Techniques and Framework Recommendations
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Nov 16, 2023 · Artificial Intelligence

ChatGLM2 vs ChatGLM3: MQA, FlashAttention, and New Prompt Features

During the Saturday session, we reviewed ChatGLM2’s upgrades—Multi‑Query Attention and FlashAttention—demonstrated deployment on Ascend + ModelArts + MindSpore, and introduced ChatGLM3’s revamped prompt design, native tool‑calling and code‑interpreter capabilities, while previewing the next lecture on text‑generation decoding.

ChatGLM2ChatGLM3FlashAttention
0 likes · 6 min read
ChatGLM2 vs ChatGLM3: MQA, FlashAttention, and New Prompt Features