Tagged articles

FP8

30 articles · Page 1 of 1

Jun 12, 2026 · Artificial Intelligence

MiniMax Open-Source MSA: High‑Performance Attention Kernels Optimized for NVIDIA SM100

MiniMax Sparse Attention (MSA) is an open‑source library that delivers high‑performance dense and block‑sparse attention operators for NVIDIA SM100 GPUs by combining a Jinja‑based csrc JIT stack with a Cutlass Python DSL (CuTe‑DSL), enabling low‑precision quantization, paging support, and seamless migration from dense code.

AI KernelsCuTe-DSLCutlass

0 likes · 5 min read

MiniMax Open-Source MSA: High‑Performance Attention Kernels Optimized for NVIDIA SM100

Architects' Tech Alliance

Jun 7, 2026 · Industry Insights

2026 China GPU Chip Industry: Market Share, Technology Trends, and Future Outlook

The 2026 analysis shows China's GPU market capturing a growing share of the $1.12 trillion global AI GPU market, with Huawei Ascend leading at 44%, domestic firms leveraging 7nm processes, Chiplet and FP8 breakthroughs, while Nvidia and AMD face increasing competition from Chinese players expanding into inference, edge and enterprise segments.

AI chipsChinaChiplet

0 likes · 5 min read

2026 China GPU Chip Industry: Market Share, Technology Trends, and Future Outlook

Architects' Tech Alliance

May 7, 2026 · Artificial Intelligence

Huawei Ascend AI Chip Detailed Specs Comparison (2025‑2028 Roadmap)

The article analyzes Huawei's Ascend AI chip evolution from the 910C baseline through the 950 series' low‑precision FP8/FP4 breakthrough to the 960/970 generation’s 8 PFLOPS performance, highlighting architectural innovations, memory and interconnect upgrades, scenario‑specific models, and a cost advantage over competing solutions.

AI chipAscendFP8

0 likes · 6 min read

Huawei Ascend AI Chip Detailed Specs Comparison (2025‑2028 Roadmap)

Architects' Tech Alliance

May 6, 2026 · Artificial Intelligence

How DeepSeek V4 and Huawei Ascend 950 Redefined China’s AI Chip Landscape

The article details how DeepSeek V4 became the first top‑level large model to run on Huawei's Ascend 950 PR chip, delivering up to 2.87× the performance of Nvidia H20, cutting inference cost by up to 90%, and spurring a booming domestic AI‑chip ecosystem and supply‑chain surge.

AI chip performanceAI inferenceCANN Next

0 likes · 10 min read

How DeepSeek V4 and Huawei Ascend 950 Redefined China’s AI Chip Landscape

Architects' Tech Alliance

May 4, 2026 · Artificial Intelligence

DeepSeek‑V4 Inference Cost Showdown: NVIDIA H100 vs Ascend 950PR vs 910C

DeepSeek‑V4, a 1.6‑trillion‑parameter MoE model with mixed‑precision attention, is benchmarked on three accelerators—NVIDIA H100, Huawei Ascend 910C, and Ascend 950PR—showing that the 950PR delivers the lowest per‑token cost in both Prefill and Decode phases, while the H100 offers the highest raw performance at a far greater price.

DeepSeek-V4FP8Huawei Ascend 950PR

0 likes · 8 min read

DeepSeek‑V4 Inference Cost Showdown: NVIDIA H100 vs Ascend 950PR vs 910C

Old Zhang's AI Learning

Apr 24, 2026 · Artificial Intelligence

DeepSeek V4 Surge: Technical Specs, Quantization Details, Deployment Costs, and Market Impact

The article compiles key information on DeepSeek V4, covering Ollama's one‑click launch, the model's FP4/FP8 mixed‑precision quantization, size reductions, high local deployment costs, recent benchmark rankings, and the accompanying stock price movements in both China and the US.

AI benchmarksDeepSeek-V4FP4

0 likes · 5 min read

DeepSeek V4 Surge: Technical Specs, Quantization Details, Deployment Costs, and Market Impact

Old Zhang's AI Learning

Apr 22, 2026 · Artificial Intelligence

Qwen3.6-27B Open‑Source: How a 27B Dense Model Outperforms the 397B Giant

The newly released Qwen3.6-27B dense multimodal model, at just 27 B parameters, surpasses the 397 B flagship on most encoding benchmarks, offers up to 1 M token context, supports FP8 quantization, and can be deployed locally via vLLM, SGLang or Transformers with modest hardware.

27BDense ModelFP8

0 likes · 12 min read

Qwen3.6-27B Open‑Source: How a 27B Dense Model Outperforms the 397B Giant

Old Zhang's AI Learning

Mar 9, 2026 · Artificial Intelligence

Deploying Qwen3.5 with vLLM: Full-Precision and Quantized Versions, Concurrency Benchmarks, and Scripts

The article walks through upgrading vLLM to 0.17.0, configuring Docker containers for 4090 GPUs, comparing FP8 and 4‑bit quantization of Qwen3.5 35B and 27B models, and presents detailed performance numbers and script parameters that reveal trade‑offs in memory usage and throughput.

4-bit quantizationDockerFP8

0 likes · 7 min read

Deploying Qwen3.5 with vLLM: Full-Precision and Quantized Versions, Concurrency Benchmarks, and Scripts

Fun with Large Models

Feb 17, 2026 · Artificial Intelligence

Inside Qwen3.5: The World’s Strongest Open‑Source Multimodal Model and Its Core Features

Qwen3.5‑397B‑A17B, the newly open‑sourced multimodal giant, combines a 400‑billion‑parameter sparse MoE architecture with FP8 pipelines and an asynchronous RL framework to deliver GPT‑5.2‑level capabilities, 60% lower memory usage, up to 19× higher throughput, and extensive image, video, and agent support, while outlining its deployment requirements and API pricing.

AI inferenceFP8Qwen3.5

0 likes · 11 min read

Inside Qwen3.5: The World’s Strongest Open‑Source Multimodal Model and Its Core Features

Design Hub

Jan 9, 2026 · Artificial Intelligence

LTX‑2 Acceleration Secrets: Boost Speed, Stability, and Visual Quality

This article walks through practical steps to speed up LTX‑2 AI video generation—enabling the NVFP4 model, updating NVIDIA drivers and CUDA, using FP8 text encoders, and applying a custom prompt‑optimizing assistant—showing memory savings, sub‑minute rendering at 1280×720, and noticeable quality gains.

AI video generationFP8LTX-2

0 likes · 11 min read

LTX‑2 Acceleration Secrets: Boost Speed, Stability, and Visual Quality

Architects' Tech Alliance

Dec 28, 2025 · Artificial Intelligence

Why AWS Trainium3 Could Redefine AI Compute: Specs, Performance, and Market Impact

AWS's new Trainium3 chip, built on a 3nm process with FP8 performance up to 2.52 PFLOPs, promises massive compute gains, lower costs, and a new cloud‑centric AI ecosystem, challenging Nvidia's dominance and reshaping the AI hardware market.

3nmAI hardwareAWS

0 likes · 12 min read

Why AWS Trainium3 Could Redefine AI Compute: Specs, Performance, and Market Impact

Architects' Tech Alliance

Oct 29, 2025 · Artificial Intelligence

Why China’s AI Chip Industry Is Poised for a Breakthrough – Trends, Challenges, and Future Outlook

This comprehensive analysis examines the strategic importance, technical challenges, innovation pathways, and market landscape of domestic AI chips in China, highlighting key players, regional clusters, core applications such as intelligent computing, autonomous driving, and robotics, and projecting future industry bottlenecks and opportunities.

AI chipsChina semiconductorFP8

0 likes · 18 min read

Why China’s AI Chip Industry Is Poised for a Breakthrough – Trends, Challenges, and Future Outlook

BirdNest Tech Talk

Oct 14, 2025 · Artificial Intelligence

How DeepSeek’s Lightning Indexer Enables Efficient Sparse Attention for Long Texts

The article explains how DeepSeek’s Lightning Indexer acts as a memory‑filtering expert that computes index scores, selects the top‑k relevant tokens, and maps a compact formula to FP8 kernel code, reducing attention complexity from 128K to 2048 tokens for massive sequences.

DeepSeekFP8Lightning Indexer

0 likes · 7 min read

How DeepSeek’s Lightning Indexer Enables Efficient Sparse Attention for Long Texts

AntTech

Oct 9, 2025 · Artificial Intelligence

Ling-1T: The Trillion‑Parameter AI Model Redefining Efficient Reasoning

Ling-1T, a trillion‑parameter flagship non‑thinking model, combines 50 billion active parameters per token, 128 K context, Evo‑CoT reasoning, and FP8 mixed‑precision training to achieve state‑of‑the‑art performance on complex reasoning, code generation, and multimodal tasks while outlining its architecture, benchmarks, limitations, and future roadmap.

AIFP8LLM

0 likes · 11 min read

Ling-1T: The Trillion‑Parameter AI Model Redefining Efficient Reasoning

Baidu Intelligent Cloud Tech Hub

Sep 4, 2025 · Artificial Intelligence

Unlocking MoE Model Power: Baidu’s Baige 5.0 AI Platform’s FP8 and Distributed Innovations

Baidu’s Baige 5.0 AI Computing Platform introduces FP8 mixed‑precision training, MoE‑aware distributed strategies, adaptive parallelism, and a three‑tier KV‑Cache, delivering over 30% training speedup and 50% inference throughput gains while keeping token latency under half a second for large‑scale models.

AIFP8MoE

0 likes · 16 min read

Unlocking MoE Model Power: Baidu’s Baige 5.0 AI Platform’s FP8 and Distributed Innovations

Java Tech Enthusiast

Aug 27, 2025 · Fundamentals

Why Fixed‑Point vs Floating‑Point Matters: Inside the New FP8 Formats

This article explains how integers and decimals are stored using fixed‑point and floating‑point representations, introduces the 8‑bit FP8 formats (E4M3 and E5M2) used in modern AI hardware, and traces the evolution from classic FP32/FP64 to the latest ultra‑compact numeric types.

AI hardwareFP8Fixed-Point

0 likes · 8 min read

Why Fixed‑Point vs Floating‑Point Matters: Inside the New FP8 Formats

Architects' Tech Alliance

Aug 26, 2025 · Artificial Intelligence

How DeepSeek‑V3.1’s New FP8 Precision Supercharges Domestic Chip Performance

DeepSeek‑V3.1 introduces the UE8M0 FP8 Scale precision, cutting memory usage by up to 75% and enabling next‑generation Chinese chips such as Ascend 910B to run 128K context models efficiently, while the ecosystem rapidly adopts FP8, yet challenges in IP autonomy and software maturity remain before global competitiveness is achieved.

AI hardwareDeepSeekFP8

0 likes · 10 min read

How DeepSeek‑V3.1’s New FP8 Precision Supercharges Domestic Chip Performance

IT Services Circle

Aug 26, 2025 · Fundamentals

What Is UE8M0? Unpacking FP8 and Fixed‑Point Numbers Behind DeepSeek V3.1

This article explains the meaning of UE8M0 by introducing fixed‑point (INT8) and floating‑point representations, showing how integers and decimals are stored in binary, describing the limitations of fixed‑point, the advantages of floating‑point scientific notation, and detailing the emerging FP8 formats such as E4M3 and E5M2 used in modern AI hardware.

AI hardwareFP8Fixed-Point

0 likes · 8 min read

What Is UE8M0? Unpacking FP8 and Fixed‑Point Numbers Behind DeepSeek V3.1

IT Services Circle

Aug 24, 2025 · Artificial Intelligence

What Is UE8M0 FP8 and Why It’s Boosting China’s Next‑Gen AI Chips

The article explains the UE8M0 FP8 precision format, its MXFP8 origins, how it reduces bandwidth and power consumption, and why Chinese AI chip makers like Cambricon, HaiGuang and Moore Threads are rapidly adopting it, signaling a shift toward domestic AI hardware independence.

AI hardwareChinese chipsDeepSeek

0 likes · 10 min read

What Is UE8M0 FP8 and Why It’s Boosting China’s Next‑Gen AI Chips

Tech Freedom Circle

Jul 17, 2025 · Artificial Intelligence

DeepSeek V3 Architecture Deep Dive: MoE, MLA, DualPipe, FP8 Mixed Precision & Multi‑Token Prediction

This article provides a detailed technical analysis of DeepSeek‑V3, covering its MOE architecture, the novel Multi‑head Latent Attention (MLA) mechanism, the DualPipe pipeline‑parallel algorithm, mixed‑precision FP8 training, and the Multi‑Token Prediction (MTP) inference improvements that together boost performance and efficiency.

DeepSeekDualPipeFP8

0 likes · 44 min read

DeepSeek V3 Architecture Deep Dive: MoE, MLA, DualPipe, FP8 Mixed Precision & Multi‑Token Prediction

360 Zhihui Cloud Developer

Apr 1, 2025 · Artificial Intelligence

DeepGEMM vs Cutlass vs Triton: Which GPU GEMM Library Delivers the Best FP8 Performance?

This article presents a comprehensive benchmark of DeepGEMM, Cutlass, and Triton on NVIDIA H20 and H800 GPUs, analyzing TFLOPS, bandwidth, latency, and speedup across various matrix sizes, and concludes which library is optimal for different workload scenarios.

CUDADeepGEMMFP8

0 likes · 15 min read

DeepGEMM vs Cutlass vs Triton: Which GPU GEMM Library Delivers the Best FP8 Performance?

Baobao Algorithm Notes

Mar 10, 2025 · Artificial Intelligence

Why DeepSeek V3’s FP8 Training Beats Traditional Schemes: A Deep Dive

This article provides a detailed technical analysis of FP8 training, comparing Nvidia’s TransformerEngine approach with DeepSeek V3’s novel scheme, and examines how block‑wise scaling, high‑precision accumulation, and vector length and correlation affect quantization error and signal‑to‑noise ratio in large‑language‑model training.

DeepSeekFP8LLM

0 likes · 20 min read

Why DeepSeek V3’s FP8 Training Beats Traditional Schemes: A Deep Dive

IT Architects Alliance

Feb 26, 2025 · Artificial Intelligence

DeepSeek Large Model: Core Architecture, Key Technologies, and Training Strategies

The article provides an in‑depth overview of DeepSeek’s large language model, detailing its mixture‑of‑experts and Transformer foundations, novel attention mechanisms, load‑balancing, multi‑token prediction, FP8 mixed‑precision training, and various training regimes such as knowledge distillation and reinforcement learning.

DeepSeekFP8Large Language Model

0 likes · 18 min read

DeepSeek Large Model: Core Architecture, Key Technologies, and Training Strategies

Tencent Technical Engineering

Feb 26, 2025 · Artificial Intelligence

Engineers' Perspectives on DeepSeek: Technical Innovations and Implications

Thirteen engineers praise DeepSeek’s open‑source, reinforcement‑learning‑driven architecture—using FP8 storage and SFT‑free training—to deliver GPT‑4‑level reasoning at one‑twentieth the cost, enabling single‑GPU deployment, lowering barriers for academia and startups, and prompting notable market reactions that could democratize advanced AI.

AI cost reductionDeepSeekFP8

0 likes · 9 min read

Engineers' Perspectives on DeepSeek: Technical Innovations and Implications

DataFunTalk

Feb 26, 2025 · Artificial Intelligence

DeepGEMM: An Open‑Source FP8 GEMM Library for Efficient AI Model Training and Inference

DeepGEMM is an open‑source FP8‑precision GEMM library that delivers up to 1350 TFLOPS on NVIDIA Hopper GPUs, offering JIT‑compiled, lightweight code (~300 lines) for dense and MoE matrix multiplication, with easy deployment, configurable environment variables, and performance advantages over CUTLASS for large AI models.

AI accelerationDeepGEMMFP8

0 likes · 7 min read

DeepGEMM: An Open‑Source FP8 GEMM Library for Efficient AI Model Training and Inference

DataFunSummit

Jan 24, 2025 · Artificial Intelligence

Challenges and Debugging Strategies for FP8 Training of Large Models

The article explains the performance benefits of using FP8 for large‑model training, outlines three main categories of FP8‑related issues such as loss spikes, divergence, and downstream metric gaps, and introduces a dedicated FP8 debug tool with metrics like MSE, cosine similarity, underflow, and overflow to help diagnose and resolve these problems.

AIFP8NVIDIA

0 likes · 9 min read

Challenges and Debugging Strategies for FP8 Training of Large Models

Baobao Algorithm Notes

Jan 3, 2025 · Artificial Intelligence

How DeepSeek-V3 Achieves Massive Scale with FP8, MoE, and System Optimizations

The article examines DeepSeek‑V3’s architecture and training pipeline, highlighting its use of MLA and a highly granular MoE design, pioneering FP8 mixed‑precision training, fine‑grained per‑tile quantization, advanced parallelism strategies, and inference optimizations such as PD separation and NanoFlow to achieve unprecedented efficiency on limited GPU resources.

DeepSeek-V3FP8Inference Optimization

0 likes · 10 min read

How DeepSeek-V3 Achieves Massive Scale with FP8, MoE, and System Optimizations

Alibaba Cloud Infrastructure

Sep 13, 2023 · Artificial Intelligence

Pai‑Megatron‑Patch: Design Principles, Key Features, and End‑to‑End Usage for Large Language Model Training

This article introduces the open‑source Pai‑Megatron‑Patch tool from Alibaba Cloud, explains its non‑intrusive patch architecture, enumerates supported models and features such as weight conversion, Flash‑Attention 2.0, FP8 training with Transformer Engine, and provides detailed command‑line examples for model conversion, pre‑training, supervised fine‑tuning, inference, and RLHF reinforcement learning pipelines.

FP8LLMMegatron

0 likes · 19 min read

Pai‑Megatron‑Patch: Design Principles, Key Features, and End‑to‑End Usage for Large Language Model Training

Alibaba Cloud Big Data AI Platform

Sep 13, 2023 · Artificial Intelligence

How Pai‑Megatron‑Patch Accelerates Large Language Model Training on Alibaba Cloud

This article introduces Pai‑Megatron‑Patch, an open‑source tool from Alibaba Cloud that streamlines large language model (LLM) training, weight conversion, FP8 mixed‑precision acceleration, and reinforcement‑learning workflows, providing detailed architecture, key features, code examples, and step‑by‑step usage instructions.

FP8LLM trainingMegatron

0 likes · 19 min read

How Pai‑Megatron‑Patch Accelerates Large Language Model Training on Alibaba Cloud

Architects' Tech Alliance

Jul 4, 2022 · Industry Insights

Inside NVIDIA Hopper H100: Architecture, Performance, and AI Breakthroughs

The article provides a detailed technical analysis of NVIDIA's Hopper‑based H100 GPU, covering its 4 nm process, 800 billion transistors, GPC/TPC hierarchy, new FP8 Tensor Cores, Transformer engine, Tensor Memory Accelerator, and the resulting six‑fold performance jump over the previous A100 generation.

AI accelerationFP8GPU architecture

0 likes · 8 min read

Inside NVIDIA Hopper H100: Architecture, Performance, and AI Breakthroughs