Tagged articles

FP8 training

9 articles · Page 1 of 1

Feb 16, 2026 · Artificial Intelligence

Qwen3.5 Deep Dive: Multimodal Architecture, Benchmarks, and Deployment Guide

This article provides a detailed analysis of Qwen3.5, covering its multimodal MoE design, massive inference speedups, extensive benchmark results against GPT‑5.2, Claude 4.5 Opus and Gemini‑3 Pro, RL scaling strategies, training infrastructure innovations, and practical usage via API and local deployment.

FP8 trainingLarge Language ModelMultimodal AI

0 likes · 13 min read

Qwen3.5 Deep Dive: Multimodal Architecture, Benchmarks, and Deployment Guide

AntTech

Sep 11, 2025 · Artificial Intelligence

Ling-mini-2.0: How a 16B MoE Model Achieves Dense-Level Performance with Only 1.4B Active Parameters

Ling-mini-2.0, an open-source 16 B MoE language model that activates only 1.4 B parameters, achieves dense-level performance with 7× efficiency, generates over 300 tokens / s, and introduces the first FP8 mixed-precision training suite, offering multiple pre-training checkpoints for the AI community.

Efficient InferenceFP8 trainingMoE

0 likes · 6 min read

Ling-mini-2.0: How a 16B MoE Model Achieves Dense-Level Performance with Only 1.4B Active Parameters

Java Web Project

Jun 4, 2025 · Artificial Intelligence

Why DeepSeek V3 Stands Out: Architecture, Performance, and Open‑Source Edge

The article analyzes DeepSeek's rapid adoption, detailing its seven core models, the third‑generation MoE architecture, FP8 mixed‑precision training, 128K context window, benchmark superiority on MMLU/HumanEval/CMMLU, low training cost, and fully open‑source release, while also introducing a companion guide for developers.

AI ArchitectureDeepSeekFP8 training

0 likes · 9 min read

Why DeepSeek V3 Stands Out: Architecture, Performance, and Open‑Source Edge

Software Engineering 3.0 Era

May 15, 2025 · Artificial Intelligence

DeepSeek‑V3 Paper Reveals Breakthrough Hardware‑Software Co‑Design for AI Efficiency

DeepSeek‑V3 demonstrates that a tightly coupled hardware‑software design—featuring a memory‑saving MLA cache, a compute‑efficient DeepSeekMoE, a multi‑token prediction module, FP8 training, LogFMT compression, and an optimized eight‑plane fat‑tree network—can train a competitive LLM with only 2,048 H800 GPUs, cutting compute by up to 80% and boosting generation speed by 1.8×.

DeepSeek-V3FP8 trainingLLM

0 likes · 12 min read

DeepSeek‑V3 Paper Reveals Breakthrough Hardware‑Software Co‑Design for AI Efficiency

AI Algorithm Path

Apr 6, 2025 · Artificial Intelligence

Meta’s Open-Source Llama 4: 2‑Trillion‑Parameter Behemoth Redefines AI

Meta’s newly released Llama 4 models—Maverick with 4 020 billion total parameters and Scout with 1 090 billion—feature a 128‑expert MoE, 10 million‑token context, native multimodal fusion, and FP8 training, delivering benchmark‑leading performance that outpaces GPT‑4o, Gemini 2.0 Flash and DeepSeek v3, while being openly available on Hugging Face and GitHub.

FP8 trainingLlama 4Meta AI

0 likes · 8 min read

Meta’s Open-Source Llama 4: 2‑Trillion‑Parameter Behemoth Redefines AI

Tencent Cloud Developer

Mar 5, 2025 · Artificial Intelligence

DeepSeek Series Overview: Core Technologies, Model Innovations, and Product Highlights

The article delivers a PPT‑style deep dive into the DeepSeek series—from the original LLM through DeepSeek‑MoE, Math, V2, V3 and R1—highlighting core innovations such as Multi‑Head Latent Attention, fine‑grained MoE, GRPO reinforcement learning, Multi‑Token Prediction, DualPipe parallelism and FP8 training that together achieve high performance at a fraction of traditional costs, and notes their integration into Tencent’s OlaChat intelligent assistant.

AIDeepSeekFP8 training

0 likes · 21 min read

DeepSeek Series Overview: Core Technologies, Model Innovations, and Product Highlights

Architect

Feb 16, 2025 · Artificial Intelligence

DeepSeek-V3, DeepSeek-R1, and Janus‑Pro: Architecture, Training Techniques, and Performance Insights

This article provides an in‑depth technical overview of DeepSeek‑V3, DeepSeek‑R1 and Janus‑Pro models, covering their Mixture‑of‑Experts architecture, novel MLA attention, auxiliary‑loss‑free load balancing, multi‑token prediction, FP8 mixed‑precision training, efficient cross‑node communication, reinforcement‑learning pipelines, multimodal modeling strategies, performance comparisons, cost statistics, and current limitations.

AI ArchitectureDeepSeek-V3FP8 training

0 likes · 18 min read

DeepSeek-V3, DeepSeek-R1, and Janus‑Pro: Architecture, Training Techniques, and Performance Insights

IT Architects Alliance

Feb 15, 2025 · Artificial Intelligence

DeepSeek: Architecture, Core Technologies, Training Strategies, and Comparative Analysis

The article provides an in‑depth overview of DeepSeek's transformer‑based foundation, Mixture‑of‑Experts architecture, novel attention mechanisms, multi‑token prediction, FP8 mixed‑precision training, knowledge distillation, reinforcement‑learning approaches, and compares its performance and cost advantages against leading models such as GPT and Gemini.

AI model architectureDeepSeekFP8 training

0 likes · 29 min read

DeepSeek: Architecture, Core Technologies, Training Strategies, and Comparative Analysis

Alibaba Cloud Developer

Feb 7, 2025 · Artificial Intelligence

Why DeepSeek V3 Achieves Low Training Costs: Inside Its AI Innovations

This article provides a comprehensive analysis of DeepSeek's large‑language‑model technology, covering the company's background, model capabilities, remarkably low training and inference costs, and the core architectural and algorithmic innovations such as MoE, MLA attention, FP8 mixed‑precision, and the DualPipe pipeline that enable efficient large‑scale AI deployment.

AI ArchitectureDeepSeekFP8 training

0 likes · 19 min read

Why DeepSeek V3 Achieves Low Training Costs: Inside Its AI Innovations