Tagged articles

FP8 quantization

6 articles · Page 1 of 1

Jun 11, 2026 · Artificial Intelligence

Google’s 26B DiffusionGemma Model Delivers 1000+ Tokens/s – Runs on a 4090

DiffusionGemma, Google DeepMind’s 26B MoE model that generates 256‑token blocks via diffusion, achieves over 1000 tokens per second on H100/H200 GPUs, offers FP8 and NVFP4 quantized versions with near‑lossless accuracy, and can be deployed locally with vLLM Docker images, though it incurs higher first‑token latency and limited concurrency.

26B modelDiffusionGemmaFP8 quantization

0 likes · 10 min read

Google’s 26B DiffusionGemma Model Delivers 1000+ Tokens/s – Runs on a 4090

Old Zhang's AI Learning

Apr 25, 2026 · Artificial Intelligence

Deploying DeepSeek‑V4‑Flash Locally on 2 × NVIDIA H20 (96 GB) – Quick Performance Test

This article walks through deploying DeepSeek‑V4‑Flash on a server with two NVIDIA H20 GPUs (96 GB each), detailing model download, Docker image preparation, launch script tweaks, memory compression via FP8 and expert parallelism, and reports observed concurrency limits and token‑per‑second speeds, including a test that disables the model's thinking mode.

DeepSeek-V4DockerFP8 quantization

0 likes · 6 min read

Deploying DeepSeek‑V4‑Flash Locally on 2 × NVIDIA H20 (96 GB) – Quick Performance Test

Old Zhang's AI Learning

Apr 23, 2026 · Artificial Intelligence

DeepSeek Quietly Open‑Sources TileKernels to Push GPU Performance to Its Limits

DeepSeek has released TileKernels, a GPU kernel library written in the TileLang DSL, that targets H100/H200/B200 GPUs and claims to approach hardware limits in compute intensity and memory bandwidth, offering MoE routing, FP8/FP4 quantization, and dual‑language PyTorch references for deep‑learning engineers.

FP8 quantizationGPU OptimizationLLM training

0 likes · 9 min read

DeepSeek Quietly Open‑Sources TileKernels to Push GPU Performance to Its Limits

BirdNest Tech Talk

Oct 15, 2025 · Artificial Intelligence

How DeepSeek‑V3.2‑Exp Achieves Fast Distributed LLM Inference with FP8 and MoE

This article walks through the DeepSeek‑V3.2‑Exp inference codebase, detailing its MoE architecture, Multi‑Head Latent Attention, FP8 quantization, custom CUDA kernels, and 8‑GPU NCCL‑based distributed execution from initialization through prefill and decode stages.

CUDADistributed InferenceFP8 quantization

0 likes · 9 min read

How DeepSeek‑V3.2‑Exp Achieves Fast Distributed LLM Inference with FP8 and MoE

Architects' Tech Alliance

May 2, 2025 · Artificial Intelligence

DeepSeek‑Prover‑V2‑671B: A Massive AI Model for Formal Mathematical Theorem Proving

DeepSeek‑Prover‑V2‑671B, a 671 billion‑parameter AI model released on Hugging Face, dramatically advances formal mathematical theorem proving with MoE architecture, FP8 quantization, 163 k token context, superior performance over GPT‑4 Turbo and other models, and broad implications for research and industry.

AIDeepSeekFP8 quantization

0 likes · 11 min read

DeepSeek‑Prover‑V2‑671B: A Massive AI Model for Formal Mathematical Theorem Proving

Alibaba Cloud Infrastructure

Feb 12, 2025 · Artificial Intelligence

Deploying DeepSeek‑R1 Distilled Qwen‑32B‑FP8 Model on Alibaba Cloud GPU Instances with Docker and OpenWebUI

This guide explains how to prepare an Alibaba Cloud GPU instance, install Docker and NVIDIA tools, pull or build a container image, and run the FP8‑quantized DeepSeek‑R1‑Distill‑Qwen‑32B model using vLLM and OpenWebUI for both offline and online inference.

DeepSeekFP8 quantizationGPU

0 likes · 18 min read

Deploying DeepSeek‑R1 Distilled Qwen‑32B‑FP8 Model on Alibaba Cloud GPU Instances with Docker and OpenWebUI