Tag

Mixture of Experts

0 views collected around this technical thread.

Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Jun 6, 2025 · Artificial Intelligence

How dots.llm1 Sets New Benchmarks for Open‑Source MoE Language Models

dots.llm1, an open‑source 142‑billion‑parameter Mixture‑of‑Experts language model from hi lab, achieves Qwen2.5‑72B‑level performance after training on 11.2 T high‑quality tokens, and the release includes full models, intermediate checkpoints, and detailed training pipelines for the research community.

AI researchMixture of ExpertsTraining Efficiency
0 likes · 10 min read
How dots.llm1 Sets New Benchmarks for Open‑Source MoE Language Models
IT Services Circle
IT Services Circle
May 25, 2025 · Artificial Intelligence

DeepSeek Core Technologies and Model Innovations: DeepSeek‑V3 and DeepSeek‑R1 Technical Overview

The article provides a detailed technical overview of DeepSeek's flagship large language models, DeepSeek‑V3 and DeepSeek‑R1, describing their MoE architecture, training frameworks, reinforcement‑learning based fine‑tuning, inference optimizations, and the broader impact of these innovations on the AI landscape while also promoting related books and resources.

AIDeepSeekMixture of Experts
0 likes · 10 min read
DeepSeek Core Technologies and Model Innovations: DeepSeek‑V3 and DeepSeek‑R1 Technical Overview
Architects' Tech Alliance
Architects' Tech Alliance
May 2, 2025 · Artificial Intelligence

DeepSeek‑Prover‑V2‑671B: A Massive AI Model for Formal Mathematical Theorem Proving

DeepSeek‑Prover‑V2‑671B, a 671 billion‑parameter AI model released on Hugging Face, dramatically advances formal mathematical theorem proving with MoE architecture, FP8 quantization, 163 k token context, superior performance over GPT‑4 Turbo and other models, and broad implications for research and industry.

AIDeepSeekFP8 Quantization
0 likes · 11 min read
DeepSeek‑Prover‑V2‑671B: A Massive AI Model for Formal Mathematical Theorem Proving
DataFunTalk
DataFunTalk
Apr 6, 2025 · Artificial Intelligence

Meta Unveils Llama 4: New Multimodal AI Models with Mixture‑of‑Experts Architecture and 10 Million‑Token Context

Meta announced the Llama 4 series—Scout, Maverick and Behemoth—featuring multimodal capabilities, Mixture‑of‑Experts design, up to 10 million‑token context windows, and state‑of‑the‑art performance on STEM, multilingual and image benchmarks, with models now downloadable from llama.com and Hugging Face.

Llama 4Mixture of ExpertsModel Training
0 likes · 14 min read
Meta Unveils Llama 4: New Multimodal AI Models with Mixture‑of‑Experts Architecture and 10 Million‑Token Context
Baidu Geek Talk
Baidu Geek Talk
Apr 2, 2025 · Artificial Intelligence

DeepSeek-VL2 Multimodal Model: Architecture, Training, and Code Walkthrough

DeepSeek‑VL2 is a state‑of‑the‑art multimodal model built on a Mixture‑of‑Experts architecture that combines a SigLIP‑L vision encoder with dynamic tiling, a two‑layer VL adaptor, and a DeepSeek‑MoE language model using Multi‑head Latent Attention, trained in three stages on diverse visual‑language and text data, and achieving strong results on benchmarks such as DocVQA and TextVQA, with full implementation and inference code available in PaddleMIX.

CodeDeepSeek-VL2Inference
0 likes · 36 min read
DeepSeek-VL2 Multimodal Model: Architecture, Training, and Code Walkthrough
AntTech
AntTech
Mar 18, 2025 · Artificial Intelligence

MoLE: Decoding by Mixture of Layer Experts Alleviates Hallucination in Large Vision-Language Models

Researchers from Ant Insurance and Zhejiang University propose MoLE, a Mixture of Layer Experts decoding method that reduces hallucinations in large vision‑language models, demonstrating state‑of‑the‑art performance on LVLM benchmarks and enabling reliable end‑to‑end medical‑record‑to‑claim automation.

AIHallucination MitigationInsurance Automation
0 likes · 7 min read
MoLE: Decoding by Mixture of Layer Experts Alleviates Hallucination in Large Vision-Language Models
Tencent Cloud Developer
Tencent Cloud Developer
Mar 5, 2025 · Artificial Intelligence

DeepSeek Series Overview: Core Technologies, Model Innovations, and Product Highlights

The article delivers a PPT‑style deep dive into the DeepSeek series—from the original LLM through DeepSeek‑MoE, Math, V2, V3 and R1—highlighting core innovations such as Multi‑Head Latent Attention, fine‑grained MoE, GRPO reinforcement learning, Multi‑Token Prediction, DualPipe parallelism and FP8 training that together achieve high performance at a fraction of traditional costs, and notes their integration into Tencent’s OlaChat intelligent assistant.

AIDeepSeekFP8 Training
0 likes · 21 min read
DeepSeek Series Overview: Core Technologies, Model Innovations, and Product Highlights
DataFunTalk
DataFunTalk
Feb 28, 2025 · Artificial Intelligence

DeepSeek LLM Series (V1‑V3) and R1: Architecture, Training Strategies, Evaluation, and Distillation

An in‑depth overview of the DeepSeek LLM series (V1‑V3) and the R1 models, covering their architectures, scaling‑law experiments, data pipelines, training strategies—including MoE, MLA, FP8, multi‑step learning‑rate scheduling, reinforcement learning, and extensive evaluation results, as well as knowledge‑distillation techniques.

AI researchMixture of ExpertsModel Distillation
0 likes · 36 min read
DeepSeek LLM Series (V1‑V3) and R1: Architecture, Training Strategies, Evaluation, and Distillation
Tencent Cloud Developer
Tencent Cloud Developer
Feb 27, 2025 · Artificial Intelligence

DeepSeek LLM Series (V1‑V3, R1) Technical Overview and Analysis

The DeepSeek technical overview details the evolution from the dense 67 B V1 model through the 236 B MoE‑based V2 and 671 B V3 with FP8 training, to the RL‑only R1 series that learns reasoning without supervision, highlighting innovations such as Grouped‑Query Attention, Multi‑Head Latent Attention, load‑balancing‑free MoE, Multi‑Token Prediction, and knowledge distillation, and reporting state‑of‑the‑art benchmark results and open‑source reproduction projects.

AI researchDeepSeekMixture of Experts
0 likes · 37 min read
DeepSeek LLM Series (V1‑V3, R1) Technical Overview and Analysis
IT Architects Alliance
IT Architects Alliance
Feb 26, 2025 · Artificial Intelligence

DeepSeek Large Model: Core Architecture, Key Technologies, and Training Strategies

The article provides an in‑depth overview of DeepSeek’s large language model, detailing its mixture‑of‑experts and Transformer foundations, novel attention mechanisms, load‑balancing, multi‑token prediction, FP8 mixed‑precision training, and various training regimes such as knowledge distillation and reinforcement learning.

DeepSeekFP8MLA
0 likes · 18 min read
DeepSeek Large Model: Core Architecture, Key Technologies, and Training Strategies
Architecture Digest
Architecture Digest
Feb 24, 2025 · Artificial Intelligence

MoBA: Mixture of Block Attention for Long‑Context Large Language Models

The article introduces MoBA, a Mixture‑of‑Block‑Attention mechanism that applies Mixture‑of‑Experts principles to transformer attention, enabling efficient long‑context processing for large language models while maintaining performance comparable to full attention through sparse, trainable block selection and seamless switching.

LLMMixture of ExpertsMoBA
0 likes · 12 min read
MoBA: Mixture of Block Attention for Long‑Context Large Language Models
Architect
Architect
Feb 21, 2025 · Artificial Intelligence

DeepSeek Model Innovations: Architecture, Training Methods, and Performance Evaluation

This article reviews DeepSeek's recent breakthroughs, including the MLA attention redesign, GRPO alignment algorithm, MoE enhancements, multi‑stage training pipelines (SFT, RL, preference tuning, distillation), and comparative performance against GPT‑4o‑Mini and Llama 3.1, highlighting both strengths and remaining challenges.

DeepSeekMixture of Expertsarchitecture
0 likes · 16 min read
DeepSeek Model Innovations: Architecture, Training Methods, and Performance Evaluation
IT Architects Alliance
IT Architects Alliance
Feb 15, 2025 · Artificial Intelligence

DeepSeek: Architecture, Core Technologies, Training Strategies, and Comparative Analysis

The article provides an in‑depth overview of DeepSeek's transformer‑based foundation, Mixture‑of‑Experts architecture, novel attention mechanisms, multi‑token prediction, FP8 mixed‑precision training, knowledge distillation, reinforcement‑learning approaches, and compares its performance and cost advantages against leading models such as GPT and Gemini.

AI Model ArchitectureDeepSeekFP8 Training
0 likes · 29 min read
DeepSeek: Architecture, Core Technologies, Training Strategies, and Comparative Analysis
Tencent Technical Engineering
Tencent Technical Engineering
Feb 14, 2025 · Artificial Intelligence

Technical Overview of DeepSeek Series Models and Innovations

The DeepSeek series introduces a refined Mixture‑of‑Experts architecture with fine‑grained expert partitioning, shared experts, and learnable load‑balancing, alongside innovations such as Group Relative Policy Optimization, Multi‑Head Latent Attention, Multi‑Token Prediction, mixed‑precision FP8 training, and the R1/R1‑Zero models that use Long‑CoT reasoning, reinforcement‑learning pipelines, and distillation to achieve OpenAI‑comparable performance at lower cost.

AIDeepSeekMixture of Experts
0 likes · 25 min read
Technical Overview of DeepSeek Series Models and Innovations
Architect
Architect
Feb 10, 2025 · Artificial Intelligence

Evolution of DeepSeek Mixture‑of‑Experts (MoE) Architecture from V1 to V3

This article reviews the development of DeepSeek's Mixture-of-Experts (MoE) models, tracing their evolution from the original DeepSeekMoE V1 through V2 to V3, detailing architectural innovations such as fine‑grained expert segmentation, shared‑expert isolation, load‑balancing losses, device‑limited routing, and the shift from softmax to sigmoid gating.

DeepSeekLLMMixture of Experts
0 likes · 21 min read
Evolution of DeepSeek Mixture‑of‑Experts (MoE) Architecture from V1 to V3
Tencent Cloud Developer
Tencent Cloud Developer
Feb 6, 2025 · Artificial Intelligence

DeepSeek V Series: Technical Overview of Scaling Laws, Grouped Query Attention, and Mixture‑of‑Experts

The article reviews DeepSeek’s V‑series papers, explaining how scaling‑law insights, Grouped Query Attention, a depth‑first design, loss‑free load balancing, multi‑token prediction and Multi‑Head Latent Attention together enable economical mixture‑of‑experts LLMs that rival closed‑source models while cutting compute and hardware costs.

DeepSeekGrouped Query AttentionMixture of Experts
0 likes · 13 min read
DeepSeek V Series: Technical Overview of Scaling Laws, Grouped Query Attention, and Mixture‑of‑Experts
Tencent Cloud Developer
Tencent Cloud Developer
Nov 6, 2024 · Artificial Intelligence

Overview of Tencent Hunyuan Large and 3D Generation Model Open‑Source Release

Tencent has open‑sourced its 389‑billion‑parameter Hunyuan Large Mixture‑of‑Experts model—featuring 52 B active parameters, 256 K token context, novel routing, KV‑cache compression, and advanced training optimizations that beat leading open‑source models—and its first text‑to‑3D/image‑to‑3D Hunyuan 3D Generation model, both downloadable via GitHub, Hugging Face, and Tencent Cloud.

3D GenerationAI researchMixture of Experts
0 likes · 9 min read
Overview of Tencent Hunyuan Large and 3D Generation Model Open‑Source Release
360 Smart Cloud
360 Smart Cloud
Jul 4, 2024 · Artificial Intelligence

Optimizing Mixture-of-Experts (MoE) Training with the QLM Framework

This article introduces the background and challenges of large language model training, explains the Mixture-of-Experts (MoE) architecture, and details several optimization techniques implemented in the QLM framework—including fine-grained and shared experts, top‑k gating, token distribution, expert parallelism, and grouped GEMM – to improve training efficiency and performance.

AIDistributed TrainingMixture of Experts
0 likes · 10 min read
Optimizing Mixture-of-Experts (MoE) Training with the QLM Framework
DeWu Technology
DeWu Technology
May 15, 2024 · Artificial Intelligence

Accelerating Large Language Model Inference: Techniques and Framework Recommendations

Deploying a dedicated inference cluster and applying four key optimizations—FlashAttention‑based attention computation, PageAttention KV‑cache management, Mixture‑of‑Experts parameter reduction, and tensor parallelism—can accelerate large language model inference by up to 50% for models as large as 70 B parameters while cutting deployment costs.

FlashAttentionInference AccelerationMixture of Experts
0 likes · 17 min read
Accelerating Large Language Model Inference: Techniques and Framework Recommendations
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Mar 20, 2024 · Artificial Intelligence

Elon Musk’s xAI Open‑Sources Grok‑1: A 314‑Billion‑Parameter MoE Large Language Model

Elon Musk’s xAI has open‑sourced Grok‑1, a 314‑billion‑parameter mixture‑of‑experts language model built with Rust and JAX, released under an Apache‑2.0 license, and the announcement includes detailed architecture specs, hardware requirements, and the broader context of Musk’s rivalry with OpenAI.

AIGrok-1Mixture of Experts
0 likes · 6 min read
Elon Musk’s xAI Open‑Sources Grok‑1: A 314‑Billion‑Parameter MoE Large Language Model