Tagged articles
126 articles
Page 1 of 2
Machine Heart
Machine Heart
May 20, 2026 · Artificial Intelligence

Can Tabular Anomaly Detection Move Beyond One‑for‑One? OFA‑TAD Introduces a One‑for‑All Paradigm

Tabular anomaly detection traditionally requires training a separate model for each dataset (one‑for‑one), but the new OFA‑TAD framework trains once on multiple source tables and directly transfers to unseen target tables without fine‑tuning, leveraging multi‑view distance encoding, MoE fusion, and synthetic pseudo‑anomalies to achieve state‑of‑the‑art performance across 34 datasets in 14 domains.

Mixture of ExpertsOFA-TADmulti-view distance
0 likes · 10 min read
Can Tabular Anomaly Detection Move Beyond One‑for‑One? OFA‑TAD Introduces a One‑for‑All Paradigm
Data Party THU
Data Party THU
May 17, 2026 · Artificial Intelligence

How DeepSeek Leverages MoE Parallelism: GPU Compute and Communication Optimizations

The article dissects DeepSeek's MoE model‑parallel strategy, explaining how GPU compute and communication are overlapped through expert, pipeline, and ZeRO‑1 parallelism, and introduces DualPipe and Waved‑EP kernels that enable efficient training on large‑scale hardware.

DeepSeekGPU Communication OverlapMixture of Experts
0 likes · 18 min read
How DeepSeek Leverages MoE Parallelism: GPU Compute and Communication Optimizations
PaperAgent
PaperAgent
May 13, 2026 · Artificial Intelligence

One-for-All Multi-Agent Collaboration: Adaptive Cross-Task Topology Design

The paper introduces OFA-MAS, a one‑for‑all multi‑agent system that learns a universal topology designer using task‑aware graph encoding and a Mixture‑of‑Experts generator, achieving superior performance, OOD generalization, robustness, and efficiency across six major benchmarks.

LLMMixture of ExpertsTask-Aware Graph Encoder
0 likes · 14 min read
One-for-All Multi-Agent Collaboration: Adaptive Cross-Task Topology Design
Lao Guo's Learning Space
Lao Guo's Learning Space
May 12, 2026 · Artificial Intelligence

Demystifying the Core Technologies Behind ChatGPT, GPT‑4, and DeepSeek

This article breaks down the key algorithms that power large‑language models—Transformer, Mixture‑of‑Experts, Flash Attention, KV‑Cache, Multi‑Token Prediction, quantization, Chain‑of‑Thought and Retrieval‑Augmented Generation—explaining how each contributes to the performance of ChatGPT, GPT‑4 and DeepSeek.

Flash AttentionKV cacheMixture of Experts
0 likes · 10 min read
Demystifying the Core Technologies Behind ChatGPT, GPT‑4, and DeepSeek
Old Zhang's AI Learning
Old Zhang's AI Learning
May 11, 2026 · Artificial Intelligence

Open‑Source Qwen3.6‑35B‑A3B Runs at 162 tok/s on a Single RTX 5090

The article introduces the open‑source Qwen3.6‑35B‑A3B model, explains its MoE architecture, three‑stage LoRA fine‑tuning, shows benchmark results where it achieves 161.9 tok/s on an RTX 5090—2.6× faster than a dense 27B counterpart—and discusses deployment tips, quantized GGUF release, and known compatibility pitfalls.

GGUF quantizationLoRA fine-tuningMixture of Experts
0 likes · 7 min read
Open‑Source Qwen3.6‑35B‑A3B Runs at 162 tok/s on a Single RTX 5090
Old Zhang's AI Learning
Old Zhang's AI Learning
May 7, 2026 · Artificial Intelligence

How Unsloth and NVIDIA Boost Consumer‑GPU LLM Training by ~25% with Three Simple Optimizations

Unsloth and NVIDIA identified three low‑level bottlenecks in LLM fine‑tuning on consumer GPUs—repeated packed‑sequence metadata construction, serialized copy‑and‑compute during gradient checkpointing, and per‑expert routing overhead in MoE—and applied targeted patches that together deliver roughly a 25% speedup without changing hardware, code, or frameworks.

GPU OptimizationLLM trainingMixture of Experts
0 likes · 12 min read
How Unsloth and NVIDIA Boost Consumer‑GPU LLM Training by ~25% with Three Simple Optimizations
AI Engineer Programming
AI Engineer Programming
May 7, 2026 · Artificial Intelligence

How Cursor Turned Its Coding Agent from Demo to Production

The article examines Cursor's journey of shipping its Composer coding agent, detailing the agentic AI model, system architecture, and the three major production challenges—diff handling, latency accumulation, and sandbox scaling—along with the engineering solutions that enabled reliable, fast, and adoptable AI‑driven code generation.

Agentic AICoding AgentCursor
0 likes · 16 min read
How Cursor Turned Its Coding Agent from Demo to Production
Machine Heart
Machine Heart
May 4, 2026 · Artificial Intelligence

Mega MoE vs SonicMoE: Which Will Lead the Next AI Speed Race?

SonicMoE, a new ultra‑fast Mixture‑of‑Experts model from Tri Dao and Ion Stoica’s team, achieves peak throughput on Nvidia Blackwell GPUs, outperforms DeepSeek’s DeepGEMM, and introduces algorithmic redesigns that decouple activation memory from expert granularity while fusing I/O‑aware kernels for up to double the speed of existing MoE frameworks.

AI PerformanceBlackwellGPU Acceleration
0 likes · 12 min read
Mega MoE vs SonicMoE: Which Will Lead the Next AI Speed Race?
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 3, 2026 · Artificial Intelligence

Running a 400B Mixture‑of‑Experts LLM on iPhone 17 Pro: Inside Flash‑MoE

The article details how the open‑source Flash‑MoE engine streams a 400‑billion‑parameter Mixture‑of‑Experts language model on an iPhone 17 Pro, achieving interactive‑level token throughput by eliminating Python dependencies, crafting a custom Metal pipeline, and streaming weights directly from SSD.

Apple SiliconFlash-MoEGCD
0 likes · 7 min read
Running a 400B Mixture‑of‑Experts LLM on iPhone 17 Pro: Inside Flash‑MoE
Data Party THU
Data Party THU
May 2, 2026 · Artificial Intelligence

Training an 11.5 B‑parameter Universal Interatomic Potential in Hours on Exascale Supercomputers

A Chinese Academy of Sciences team introduced the MatRIS‑MoE model and the Janus training framework, enabling a 11.5 billion‑parameter universal machine‑learning interatomic potential to be trained on two exascale systems at 1.2 EFLOPS, compressing weeks‑long training into a few hours.

AI for ScienceExascale trainingHigh‑performance computing
0 likes · 8 min read
Training an 11.5 B‑parameter Universal Interatomic Potential in Hours on Exascale Supercomputers
Machine Heart
Machine Heart
May 1, 2026 · Artificial Intelligence

How a 400B Mixture‑of‑Experts Model Runs on the iPhone 17 Pro

The article details the Flash‑MoE project that streams the 400 billion‑parameter Qwen3.5‑397B‑A17B mixture‑of‑experts model on an iPhone 17 Pro, achieving up to 0.6 tokens per second with a custom Metal‑GPU pipeline, zero‑Python code, and SSD‑backed weight streaming that keeps only 5.5 GB in RAM.

Flash-MoELLMMetal
0 likes · 7 min read
How a 400B Mixture‑of‑Experts Model Runs on the iPhone 17 Pro
Machine Heart
Machine Heart
Apr 30, 2026 · Artificial Intelligence

Beyond DeepSeek V4: A Trillion‑Parameter LLM Trained End‑to‑End on Domestic Chips

The article analyzes how both DeepSeek V4 and Meituan's LongCat‑2.0‑P preview, each with trillion‑scale parameters and 1 M‑token context, were trained and inferred entirely on Chinese‑made accelerators, detailing memory optimizations, deterministic operators, MoE redesigns, and massive multi‑card clusters that prove domestic compute can meet top‑tier AI workloads.

Deterministic OpsDomestic AI ChipLongCat
0 likes · 13 min read
Beyond DeepSeek V4: A Trillion‑Parameter LLM Trained End‑to‑End on Domestic Chips
CodeTrend
CodeTrend
Apr 26, 2026 · Artificial Intelligence

DeepSeek V4 Architecture: High‑Efficiency Long‑Context Model Design

DeepSeek V4, released in April 2026, introduces two versions—Pro and Flash—with up to 1.6 trillion parameters and a million‑token context window, leveraging hybrid attention, compressed KV cache, and specialized training techniques to dramatically cut hardware dependence and inference cost.

DeepSeekFP4Mixture of Experts
0 likes · 5 min read
DeepSeek V4 Architecture: High‑Efficiency Long‑Context Model Design
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Apr 25, 2026 · Artificial Intelligence

How DeepSeek V4 Advances Structured Optimization in the Large‑Model Era

The article analyses DeepSeek V4’s architectural innovations—including Compressed Sparse Attention, Heavily Compressed Attention, a cross‑layer MoE design, and an Agent‑RL framework with Generative Reward Models and multi‑teacher distillation—while comparing its long‑context capabilities and efficiency to rival LLMs such as GLM, Kimi, Claude, GPT and Gemini.

Agent Reinforcement LearningCompressed Sparse AttentionDeepSeek-V4
0 likes · 7 min read
How DeepSeek V4 Advances Structured Optimization in the Large‑Model Era
Architect's Tech Stack
Architect's Tech Stack
Apr 25, 2026 · Artificial Intelligence

DeepSeek‑V4 Launch: 1.6 T Parameters, 1 M‑Token Context, Programming Skills Lead Open‑Source Rankings

DeepSeek released the V4 series—V4‑Pro (1.6 T total, 49 B active) and V4‑Flash (284 B total, 13 B active)—featuring three architectural upgrades, three inference modes, mixed‑precision FP4/FP8 weights, and benchmark results that place its programming ability at the top of open‑source models while supporting a million‑token context window.

AI ArchitectureBenchmarkDeepSeek
0 likes · 5 min read
DeepSeek‑V4 Launch: 1.6 T Parameters, 1 M‑Token Context, Programming Skills Lead Open‑Source Rankings
ArcThink
ArcThink
Apr 25, 2026 · Artificial Intelligence

DeepSeek V4’s Silent Launch: 1.6 T Parameters, Triple Innovation, and Redefined Accessibility

DeepSeek V4 quietly debuted with a 1.6‑trillion‑parameter MoE model, introducing CSA+HCA compressed attention, mHC manifold‑constrained hyperconnections, and the Muon optimizer, achieving 1M‑token context at a quarter of V3’s cost, top Codeforces and LiveCodeBench scores, a 1/7 Opus price, MIT open‑source licensing, and dual‑stack Ascend NPU/NVIDIA GPU support.

BenchmarkDeepSeek-V4Manifold-constrained Hyperconnection
0 likes · 17 min read
DeepSeek V4’s Silent Launch: 1.6 T Parameters, Triple Innovation, and Redefined Accessibility
SuanNi
SuanNi
Apr 21, 2026 · Artificial Intelligence

How Qwen3.6‑35B‑A3B Matches Dense Models with Only 30 B Active Parameters

The article analyzes Qwen3.6‑35B‑A3B’s MoE architecture, showing how its 30 B active parameters outperform larger dense models across programming, agent, and multimodal benchmarks, and examines the flagship Qwen3.6‑Max‑Preview’s substantial gains in world knowledge, instruction following, and third‑party rankings.

AI EvaluationBenchmarkMixture of Experts
0 likes · 5 min read
How Qwen3.6‑35B‑A3B Matches Dense Models with Only 30 B Active Parameters
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 21, 2026 · Artificial Intelligence

How a 22‑Year‑Old Reversed‑Engineered Mythos into OpenMythos Using MoE and DeepSeek‑Inspired Attention

OpenMythos re‑creates the Claude Mythos architecture as a Recurrent‑Depth Transformer with MoE routing, achieving comparable performance to larger Transformers while using roughly half the parameters, and demonstrates systematic generalization and depth extrapolation through looped inference in latent space.

AI ArchitectureLooped Language ModelsMixture of Experts
0 likes · 6 min read
How a 22‑Year‑Old Reversed‑Engineered Mythos into OpenMythos Using MoE and DeepSeek‑Inspired Attention
AI Large-Model Wave and Transformation Guide
AI Large-Model Wave and Transformation Guide
Apr 18, 2026 · Artificial Intelligence

Does Qwen3.6‑35B‑A3B Really Outclass All AI Coding Models? Inside the Benchmark Breakdown

Qwen3.6‑35B‑A3B, a mixture‑of‑experts model that activates only 3 B parameters, outperforms leading AI systems across SWE‑bench, Terminal‑Bench, NL2Repo and several agentic coding benchmarks, while also achieving top scores in GPQA, HMMT and RealWorldQA, prompting a reassessment of domestic LLM capabilities.

AI CodingAgentic CodingBenchmark
0 likes · 7 min read
Does Qwen3.6‑35B‑A3B Really Outclass All AI Coding Models? Inside the Benchmark Breakdown
Machine Heart
Machine Heart
Apr 17, 2026 · Artificial Intelligence

DeepSeek Introduces Mega MoE and FP4 Indexer – Inside the New GPU Fusion Kernel

DeepSeek's latest DeepGEMM update adds Mega MoE, a fused GPU kernel that collapses the entire Mixture‑of‑Experts pipeline and overlaps computation with NVLink communication, while also unveiling an FP4 indexer and FP8×FP4 precision experiments, signaling a push toward highly efficient large‑scale AI training.

DeepGEMMDeepSeekFP4 Indexer
0 likes · 5 min read
DeepSeek Introduces Mega MoE and FP4 Indexer – Inside the New GPU Fusion Kernel
Machine Heart
Machine Heart
Mar 31, 2026 · Artificial Intelligence

ProMoE: Explicit Routing Breaks the Scaling Bottleneck of Diffusion‑Transformer MoE (ICLR 2026)

ProMoE introduces a two‑step routing MoE framework with explicit semantic guidance that tackles the high spatial redundancy and functional heterogeneity of visual tokens, enabling diffusion transformers to scale efficiently and outperform dense models and prior MoE approaches across generation, convergence, and scaling benchmarks.

Diffusion TransformerExplicit RoutingMixture of Experts
0 likes · 9 min read
ProMoE: Explicit Routing Breaks the Scaling Bottleneck of Diffusion‑Transformer MoE (ICLR 2026)
AIWalker
AIWalker
Mar 23, 2026 · Artificial Intelligence

Dynamic Dense Computing and Minimal End‑to‑End Design: YOLO-Master & YOLO26

By introducing a dynamic mixture‑of‑experts routing scheme and an end‑to‑end architecture that eliminates NMS and DFL, YOLO‑Master and YOLO26 dramatically cut compute waste and latency on edge devices, achieving up to 43% faster CPU inference while keeping model accuracy, with all code openly released.

Computer VisionMixture of ExpertsModel Optimization
0 likes · 7 min read
Dynamic Dense Computing and Minimal End‑to‑End Design: YOLO-Master & YOLO26
AIWalker
AIWalker
Mar 7, 2026 · Artificial Intelligence

YOLO-Master v2026.02 Unveils Four Innovations for SOTA Object Detection

Tencent’s YOLO-Master v2026.02 adds a Mixture‑of‑Experts architecture, zero‑overhead LoRA fine‑tuning, Sparse SAHI inference for large images, and Cluster‑Weighted NMS, delivering 3‑5× faster inference, up to 70% reduced training resources, and markedly higher detection accuracy across diverse benchmarks.

Computer VisionLoRAMixture of Experts
0 likes · 15 min read
YOLO-Master v2026.02 Unveils Four Innovations for SOTA Object Detection
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 23, 2026 · Artificial Intelligence

System Engineering Behind Billions of Parameters: Insider Training Details from Seven Top AI Labs

This article systematically dissects the engineering decisions behind frontier large‑language‑model training—covering architecture choices, attention variants, optimizer evolution, data‑curation strategies, scaling‑law insights, and post‑training SFT/RL pipelines—based on open‑source reports from seven leading AI laboratories.

Mixture of ExpertsModel Traininglarge language models
0 likes · 26 min read
System Engineering Behind Billions of Parameters: Insider Training Details from Seven Top AI Labs
AI Engineering
AI Engineering
Feb 12, 2026 · Artificial Intelligence

MiniMax M2.5: 230B‑Parameter Model Activates 10B, Near Claude Sonnet for One‑Tenth the Cost

MiniMax’s new open‑source M2.5 model, built on a 230 billion‑parameter mixture‑of‑experts architecture that activates only 10 billion parameters per inference, delivers performance comparable to Claude Opus 4.6 across benchmarks, while costing roughly one‑tenth as much, and is already handling a large share of the company’s internal tasks.

AI AgentsClaude OpusMiniMax M2.5
0 likes · 6 min read
MiniMax M2.5: 230B‑Parameter Model Activates 10B, Near Claude Sonnet for One‑Tenth the Cost
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 10, 2026 · Artificial Intelligence

Inside GLM-5: 745B Parameters, DeepSeek‑style Sparse Attention, and a 60% Stock Surge

The GLM-5 architecture, uncovered from a GitHub PR, doubles the previous model to 745 B parameters, adopts DeepSeek‑V3 sparse attention and multi‑token prediction, features a 78‑layer MoE with 256 experts, supports a 202K‑token context window, and its rumored test model "Pony Alpha" sparked a 60% rise in Zhipu AI's stock amid a crowded AI release season.

AI Stock ImpactDeepSeekGLM-5
0 likes · 6 min read
Inside GLM-5: 745B Parameters, DeepSeek‑style Sparse Attention, and a 60% Stock Surge
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Feb 7, 2026 · Artificial Intelligence

Why the ‘Skills’ Approach Is the Third Major Compromise Shaping Enterprise AI in 2026

The article argues that embracing the Skills paradigm— a lightweight, low‑cost alternative to large‑scale model training—represents the third major compromise in the large‑model era, balancing reduced emergence and planning hallucinations against increased stability and engineering efficiency for enterprise AI deployments.

Agentic AIEnterprise AIMixture of Experts
0 likes · 8 min read
Why the ‘Skills’ Approach Is the Third Major Compromise Shaping Enterprise AI in 2026
PaperAgent
PaperAgent
Jan 22, 2026 · Artificial Intelligence

How STEM Replaces MoE Routing with Simple Table Lookup for Faster Transformers

The article presents STEM, a method that transforms dense and MoE transformer architectures by converting the expert routing step into a static table‑lookup operation, achieving higher parameter efficiency, lower communication overhead, and improved interpretability while maintaining or boosting downstream task performance.

Embedding LookupInterpretabilityMixture of Experts
0 likes · 6 min read
How STEM Replaces MoE Routing with Simple Table Lookup for Faster Transformers
Programmer's Advance
Programmer's Advance
Jan 21, 2026 · Artificial Intelligence

Why GLM‑4.7‑Flash Delivers 70B‑Level Performance with Only 30B Parameters

GLM‑4.7‑Flash, released by Zhipu AI on Jan 20 2026, uses a Mixture‑of‑Experts (MoE) backbone and a Multi‑Latent Attention (MLA) mechanism to achieve near‑70B model quality with just 30 B total and 3 B active parameters, running on a single 24 GB GPU or even a Mac, while remaining fully open‑source and free to use.

AI model benchmarkGLM-4.7-FlashMixture of Experts
0 likes · 15 min read
Why GLM‑4.7‑Flash Delivers 70B‑Level Performance with Only 30B Parameters
AI Insight Log
AI Insight Log
Jan 20, 2026 · Artificial Intelligence

Is GLM-4.7-Flash the New 30B‑Level LLM King? Open‑Source and Ollama‑Ready

GLM‑4.7‑Flash, a 30B‑parameter MoE LLM released as fully open‑source and free, delivers 30B‑class performance across six benchmarks, runs locally with a single Ollama command, and offers a faster cloud‑hosted version with modest token‑based pricing, though hardware costs still apply.

Anthropic APIBenchmarkGLM-4.7-Flash
0 likes · 7 min read
Is GLM-4.7-Flash the New 30B‑Level LLM King? Open‑Source and Ollama‑Ready
JD Tech
JD Tech
Jan 13, 2026 · Artificial Intelligence

Mastering Large Language Models: Transformers, Scaling Laws, and MoE Explained

This extensive guide walks readers through the fundamentals of large language models, covering transformer architecture, pre‑training and fine‑tuning techniques, scaling laws, emergent abilities, mixture‑of‑experts designs, and practical comparisons, providing clear explanations, code snippets, and visual illustrations for deep learning practitioners.

Fine-tuningMixture of Expertsemergent abilities
0 likes · 47 min read
Mastering Large Language Models: Transformers, Scaling Laws, and MoE Explained
Baobao Algorithm Notes
Baobao Algorithm Notes
Dec 25, 2025 · Artificial Intelligence

TeleChat3-105B: China’s First 100B‑Scale MoE Model and Its Technical Breakthroughs

The article analyzes TeleChat3-105B-A4.7-Thinking, the first domestically built 100‑billion‑parameter Mixture‑of‑Experts model, detailing its multi‑dimensional evaluation, three‑stage training pipeline, hardware‑level optimizations, fine‑grained architecture, and its significance for the evolving AI competition landscape.

AI trainingChinese AIMixture of Experts
0 likes · 6 min read
TeleChat3-105B: China’s First 100B‑Scale MoE Model and Its Technical Breakthroughs
HyperAI Super Neural
HyperAI Super Neural
Dec 19, 2025 · Artificial Intelligence

Weekly AI Paper Digest: Open-Source LLMs, Agent Systems, and Long-Context Reasoning

This week’s AI paper roundup reviews six recent research works—including RecGPT‑V2, Nemotron 3 Nano, FrontierScience benchmark, AutoGLM, Deeper‑GXX, and QwenLong‑L1.5—highlighting advances in large‑language‑model‑driven recommendation, Mixture‑of‑Experts models, expert‑level scientific reasoning, GUI‑based foundation agents, graph neural network deepening, and ultra‑long‑context inference.

AI researchAgent SystemsBenchmark
0 likes · 6 min read
Weekly AI Paper Digest: Open-Source LLMs, Agent Systems, and Long-Context Reasoning
AI Frontier Lectures
AI Frontier Lectures
Dec 9, 2025 · Artificial Intelligence

Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive

This article analyzes why optimizing sequence‑level rewards for LLMs with token‑level surrogate objectives can improve reinforcement‑learning stability, explains the theoretical conditions required, introduces Routing Replay for MoE models, and presents extensive experiments validating the approach.

Importance SamplingMixture of Expertslarge language models
0 likes · 12 min read
Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive
PaperAgent
PaperAgent
Dec 4, 2025 · Artificial Intelligence

Mistral 3 Unveiled: How Its New Open‑Source Models Redefine Performance and Cost

Mistral AI’s latest open‑source release, Mistral 3, introduces three compact dense models and the powerful Mistral Large 3 MoE model, outperforming domestic rivals in benchmarks, offering strong multilingual and multimodal capabilities, and delivering the lowest cost‑performance ratio among open‑source LLMs.

Mistral 3Mixture of ExpertsModel Benchmark
0 likes · 4 min read
Mistral 3 Unveiled: How Its New Open‑Source Models Redefine Performance and Cost
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Dec 3, 2025 · Artificial Intelligence

2026 Forecast: How Large‑Model AI Will Evolve After 2025 Breakthroughs

The article reviews the major 2025 breakthroughs in multimodal, open‑source, and deployment technologies for large models and outlines four 2026 trends—including ToC vs. ToB service split, dual‑hand data generation, MoE routing advances, and AI4Science breakthroughs—that will shape the next wave of AI development.

AI deploymentAI4ScienceMixture of Experts
0 likes · 6 min read
2026 Forecast: How Large‑Model AI Will Evolve After 2025 Breakthroughs
AntTech
AntTech
Nov 11, 2025 · Artificial Intelligence

Breaking the Efficiency Wall: Ant Group’s Bailing Model Paves the Way to AGI

At CNCC 2025, Ant Group’s Vice President Zhou Jun outlined the Bailing large‑model’s five‑layer architecture, hybrid linear attention, Ling Scaling Law, and novel training algorithms that dramatically cut costs and latency, achieving state‑of‑the‑art performance on math and code benchmarks while promoting open‑source collaboration toward AGI.

AGIMixture of Expertslarge language models
0 likes · 8 min read
Breaking the Efficiency Wall: Ant Group’s Bailing Model Paves the Way to AGI
Tencent Technical Engineering
Tencent Technical Engineering
Nov 10, 2025 · Artificial Intelligence

How Large Language Models Evolved in 2025: From DeepSeek to Kimi‑K2 and Beyond

This article maps the rapid evolution of open‑source large language models in 2025, explains the underlying architectural breakthroughs such as MLA, MoE, and NSA, compares dozens of models—including DeepSeek‑V3, OLMo2, Gemma3, Llama4, Qwen3, and Kimi‑K2—and highlights the emergence of powerful AI assistants like Dola, providing developers with a concise technical roadmap.

AI AssistantLLM efficiencyMixture of Experts
0 likes · 44 min read
How Large Language Models Evolved in 2025: From DeepSeek to Kimi‑K2 and Beyond
DataFunTalk
DataFunTalk
Nov 10, 2025 · Artificial Intelligence

How Open-Source AI Models Are Outperforming Closed Giants on Cost and Performance

The article examines how open‑source models like DeepSeek‑R1 and Kimi K2 Thinking are challenging the traditional closed‑source, high‑capital AI paradigm by achieving comparable or superior benchmark results at a fraction of the training cost, reshaping market expectations, investment strategies, and the economics of AI development.

AI market dynamicsMixture of Expertsbenchmark performance
0 likes · 11 min read
How Open-Source AI Models Are Outperforming Closed Giants on Cost and Performance
Radish, Keep Going!
Radish, Keep Going!
Nov 4, 2025 · Artificial Intelligence

What You Need to Know: Backpropagation, FreeBSD, AI MoE, and More Tech Insights

This roundup covers essential insights on backpropagation fundamentals, FreeBSD self‑hosting benefits, an open‑source 30B MoE AI model, misuse of cybercrime laws, historic moving sidewalks, party‑planning hacks, deceptive signal‑strength tricks, a 1000‑hp micro motor, Nextcloud performance fixes, and Google Cloud account suspensions, offering a blend of technical depth and practical advice.

AIBackpropagationDeep Learning
0 likes · 11 min read
What You Need to Know: Backpropagation, FreeBSD, AI MoE, and More Tech Insights
Fighter's World
Fighter's World
Oct 25, 2025 · Artificial Intelligence

Rationally Understanding AI Capability Limits: Jason Wei’s Framework from Stanford

Jason Wei’s Stanford AI Club talk outlines three analytical ideas—Intelligence as a Commodity, Verifier's Law, and the Jagged Edge of Intelligence—to help businesses rationally assess AI’s economic shape, verification dynamics, and uneven performance across tasks.

Adaptive ComputationHuman-in-the-LoopIntelligence as a Commodity
0 likes · 23 min read
Rationally Understanding AI Capability Limits: Jason Wei’s Framework from Stanford
Meituan Technology Team
Meituan Technology Team
Sep 11, 2025 · Artificial Intelligence

How LongCat-Flash Achieves Ultra-Fast, Low-Cost AI Agent Inference with SGLang

LongCat-Flash, an open‑source Mixture‑of‑Experts model released by Meituan, leverages model‑system co‑design, PD‑disaggregation, SBO scheduling and large‑scale expert parallelism within the SGLang framework to deliver dramatically lower latency, higher throughput and cost‑effective inference for AI agents, with detailed deployment instructions provided.

LongCat-FlashLow latencyMixture of Experts
0 likes · 15 min read
How LongCat-Flash Achieves Ultra-Fast, Low-Cost AI Agent Inference with SGLang
Data Party THU
Data Party THU
Sep 10, 2025 · Industry Insights

MoE vs MoR: Deep Dive into Expert and Recursive Mixture Architectures for LLMs

This article provides a comprehensive technical comparison between Mixture of Experts (MoE) and the newly proposed Mixture of Recursion (MoR) architectures, covering design principles, parameter efficiency, inference latency, training stability, routing mechanisms, hardware deployment considerations, and suitable application scenarios.

Hardware DeploymentMixture of ExpertsMixture of Recursion
0 likes · 13 min read
MoE vs MoR: Deep Dive into Expert and Recursive Mixture Architectures for LLMs
Data Party THU
Data Party THU
Sep 4, 2025 · Artificial Intelligence

How MXFP4 Quantization Lets a 1200‑Billion‑Parameter LLM Run on a Single 80GB GPU

This article analyzes the memory bottleneck of massive language models, explains the mathematical modeling of memory requirements, evaluates traditional sharding limits, and details how GPT‑OSS’s MXFP4 quantization combined with Mixture‑of‑Experts reduces memory, bandwidth, and compute demands enough to fit a 1200‑billion‑parameter model onto an 80 GB GPU with minimal accuracy loss.

FP4LLMMXFP4
0 likes · 11 min read
How MXFP4 Quantization Lets a 1200‑Billion‑Parameter LLM Run on a Single 80GB GPU
Data Party THU
Data Party THU
Sep 3, 2025 · Artificial Intelligence

Unlocking Large Model Secrets: Transformers, MoE, Fine‑Tuning, RAG & KV Caching

This article provides a comprehensive technical overview of today’s large‑model ecosystem, covering the Transformer architecture, Mixture‑of‑Experts extensions, five fine‑tuning methods, the evolution from traditional RAG to agentic RAG, classic agent design patterns, diverse text‑chunking strategies, and the KV‑cache optimization that accelerates inference.

Agentic AIFine‑tuningKV cache
0 likes · 13 min read
Unlocking Large Model Secrets: Transformers, MoE, Fine‑Tuning, RAG & KV Caching
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 2, 2025 · Artificial Intelligence

How LongCat‑Flash Achieves Record Speed and Efficiency for a 560B MoE Model

LongCat‑Flash is a 560‑billion‑parameter Mixture‑of‑Experts LLM that combines a dynamic zero‑computation expert design, shortcut‑connected MoE communication, variance‑aligned scaling, and a three‑stage agent‑centric pre‑training pipeline, delivering over 100 TPS on H800 GPUs at a cost of $0.70 per million tokens.

Inference OptimizationLongCat-FlashMixture of Experts
0 likes · 23 min read
How LongCat‑Flash Achieves Record Speed and Efficiency for a 560B MoE Model
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Sep 1, 2025 · Artificial Intelligence

How MERA’s Retrieval‑Augmented MoE Boosts Stock Selection Performance by 11%

The article introduces MERA, a Retrieval‑Augmented Mixture‑of‑Experts module that addresses the inability of single‑branch deep‑learning models to capture diverse stock market patterns, describes its self‑supervised pretraining, gating and expert mechanisms, and shows that it improves stock‑selection metrics by up to 11% on major Chinese indices.

MERAMixture of ExpertsRetrieval Augmented Representation
0 likes · 14 min read
How MERA’s Retrieval‑Augmented MoE Boosts Stock Selection Performance by 11%
AI Info Trend
AI Info Trend
Aug 12, 2025 · Artificial Intelligence

OpenAI’s First Open‑Source Weights: Inside gpt‑oss‑120B & 20B Models

OpenAI has unveiled its first open‑source weight models in over five years—gpt‑oss‑120B and gpt‑oss‑20B—detailing their MoE architecture, quantization techniques, benchmark performance, licensing, and the industry’s mixed reactions, while hinting at future open‑source AI developments.

AI benchmarksGPT-OSSIndustry analysis
0 likes · 6 min read
OpenAI’s First Open‑Source Weights: Inside gpt‑oss‑120B & 20B Models
Programmer DD
Programmer DD
Aug 6, 2025 · Artificial Intelligence

What Is GPT-OSS? Inside OpenAI’s New Open‑Source Large Language Models

OpenAI has unveiled GPT‑OSS, an open‑source large language model series featuring a 120‑billion‑parameter version for high‑throughput production and a 20‑billion‑parameter version for low‑latency consumer hardware, both using Mixture‑of‑Experts architecture, 4‑bit quantization, and released under the permissive Apache 2.0 license.

4-bit quantizationApache 2.0 licenseGPT-OSS
0 likes · 3 min read
What Is GPT-OSS? Inside OpenAI’s New Open‑Source Large Language Models
AI Frontier Lectures
AI Frontier Lectures
Jul 31, 2025 · Artificial Intelligence

What’s Driving the Latest LLM Architecture Trends? DeepSeek, OLMo, Gemma, and More Explained

This article examines the evolution of large language model architectures over the past seven years, comparing key design choices such as Multi‑Head Latent Attention, Grouped‑Query Attention, Mixture‑of‑Experts, sliding‑window attention, normalization placement, and optimizer variants across models like DeepSeek V3, OLMo 2, Gemma 3, Llama 4, Qwen 3, SmolLM 3, and Kimi 2.

AI researchLLM comparisonMixture of Experts
0 likes · 30 min read
What’s Driving the Latest LLM Architecture Trends? DeepSeek, OLMo, Gemma, and More Explained
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Jun 24, 2025 · Artificial Intelligence

How Transformers and Mixture-of-Experts Power Large Language Models

This article explores the role of Transformers and Mixture‑of‑Experts in large models, outlines five fine‑tuning methods, compares traditional and agentic RAG, presents classic agent design patterns, text‑chunking strategies, levels of intelligent agent systems, and explains KV‑caching techniques.

Fine-tuningMixture of ExpertsRAG
0 likes · 2 min read
How Transformers and Mixture-of-Experts Power Large Language Models
AntTech
AntTech
Jun 18, 2025 · Artificial Intelligence

How Ant Group’s Baoling Models Push Toward AGI with MoE and Multimodal Innovations

In a detailed AICon talk, Ant Group’s Baoling team leader Zhou Jun outlines their latest large‑model training techniques, MoE architecture optimizations, multimodal breakthroughs, open‑source releases, and the strategic roadmap needed to turn AI into a ubiquitous, “scan‑code‑level” everyday assistant.

AI InfrastructureMixture of Expertslarge language models
0 likes · 25 min read
How Ant Group’s Baoling Models Push Toward AGI with MoE and Multimodal Innovations
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Jun 6, 2025 · Artificial Intelligence

How dots.llm1 Sets New Benchmarks for Open‑Source MoE Language Models

dots.llm1, an open‑source 142‑billion‑parameter Mixture‑of‑Experts language model from hi lab, achieves Qwen2.5‑72B‑level performance after training on 11.2 T high‑quality tokens, and the release includes full models, intermediate checkpoints, and detailed training pipelines for the research community.

AI researchMixture of ExpertsTraining efficiency
0 likes · 10 min read
How dots.llm1 Sets New Benchmarks for Open‑Source MoE Language Models
Java Web Project
Java Web Project
Jun 4, 2025 · Artificial Intelligence

Why DeepSeek V3 Stands Out: Architecture, Performance, and Open‑Source Edge

The article analyzes DeepSeek's rapid adoption, detailing its seven core models, the third‑generation MoE architecture, FP8 mixed‑precision training, 128K context window, benchmark superiority on MMLU/HumanEval/CMMLU, low training cost, and fully open‑source release, while also introducing a companion guide for developers.

AI ArchitectureDeepSeekFP8 training
0 likes · 9 min read
Why DeepSeek V3 Stands Out: Architecture, Performance, and Open‑Source Edge
IT Services Circle
IT Services Circle
May 25, 2025 · Artificial Intelligence

DeepSeek Core Technologies and Model Innovations: DeepSeek‑V3 and DeepSeek‑R1 Technical Overview

The article provides a detailed technical overview of DeepSeek's flagship large language models, DeepSeek‑V3 and DeepSeek‑R1, describing their MoE architecture, training frameworks, reinforcement‑learning based fine‑tuning, inference optimizations, and the broader impact of these innovations on the AI landscape while also promoting related books and resources.

AIDeepSeekMixture of Experts
0 likes · 10 min read
DeepSeek Core Technologies and Model Innovations: DeepSeek‑V3 and DeepSeek‑R1 Technical Overview
AI Algorithm Path
AI Algorithm Path
May 9, 2025 · Artificial Intelligence

A Visual Guide to Mixture of Experts (MoE) Architecture in Large Language Models

This article explains the Mixture of Experts (MoE) technique used in modern LLMs, detailing its core components—experts and router—comparing dense and sparse layers, describing load‑balancing, expert capacity, and routing strategies, and showcasing real‑world examples such as Switch Transformer, Vision‑MoE, and Mixtral 8x7B.

Expert CapacityLLMMixture of Experts
0 likes · 15 min read
A Visual Guide to Mixture of Experts (MoE) Architecture in Large Language Models
Architects' Tech Alliance
Architects' Tech Alliance
May 2, 2025 · Artificial Intelligence

DeepSeek‑Prover‑V2‑671B: A Massive AI Model for Formal Mathematical Theorem Proving

DeepSeek‑Prover‑V2‑671B, a 671 billion‑parameter AI model released on Hugging Face, dramatically advances formal mathematical theorem proving with MoE architecture, FP8 quantization, 163 k token context, superior performance over GPT‑4 Turbo and other models, and broad implications for research and industry.

AIDeepSeekFP8 quantization
0 likes · 11 min read
DeepSeek‑Prover‑V2‑671B: A Massive AI Model for Formal Mathematical Theorem Proving
AI Algorithm Path
AI Algorithm Path
May 2, 2025 · Artificial Intelligence

Qwen3 Launch: Open-Source Models Redefine General AI

The Qwen3 series introduces eight open‑source large language models ranging from 0.6B to 235B parameters, combines dense and Mixture‑of‑Experts architectures, supports multimodal input, offers mixed inference modes, and demonstrates benchmark superiority over leading models such as OpenAI o1 and Gemini 2.5 Pro.

AI AgentsBenchmarkMixture of Experts
0 likes · 10 min read
Qwen3 Launch: Open-Source Models Redefine General AI
AI Frontier Lectures
AI Frontier Lectures
Apr 12, 2025 · Artificial Intelligence

How ByteDance Scales Attn/MoE: Cost Models, Mesh Communication, and Network Hacks

The article analyzes ByteDance's MegaScale‑Infer paper, detailing micro‑batching, M:N Attn‑MoE ratios, cost‑driven constraint search, communication redesign with Mesh All‑2‑All, network latency challenges, and innovative NIC and routing solutions for large‑scale mixture‑of‑experts inference.

AI inferenceByteDanceCost Optimization
0 likes · 7 min read
How ByteDance Scales Attn/MoE: Cost Models, Mesh Communication, and Network Hacks
21CTO
21CTO
Apr 7, 2025 · Artificial Intelligence

Llama 4 Unveiled: Breakthrough Multimodal Models Redefine AI Capabilities

Meta's Llama 4 series introduces the Scout, Maverick, and Behemoth models—featuring Mixture‑of‑Experts architectures, unprecedented 10‑million‑token context windows, and state‑of‑the‑art performance across vision, language, and multimodal benchmarks—while emphasizing efficient training, open‑source availability, and robust safety safeguards.

AI SafetyLlama 4Mixture of Experts
0 likes · 14 min read
Llama 4 Unveiled: Breakthrough Multimodal Models Redefine AI Capabilities
Data Thinking Notes
Data Thinking Notes
Apr 6, 2025 · Artificial Intelligence

Why Mixture of Experts (MoE) is Revolutionizing Large AI Models

Mixture of Experts (MoE) leverages dynamic conditional computation and specialized expert networks to overcome the parameter explosion and inefficiency of dense models, offering scalable capacity, multi‑task adaptability, and improved efficiency, while addressing challenges such as training stability, communication overhead, and load balancing.

Deep LearningMixture of ExpertsModel Scaling
0 likes · 7 min read
Why Mixture of Experts (MoE) is Revolutionizing Large AI Models
AI Algorithm Path
AI Algorithm Path
Apr 6, 2025 · Artificial Intelligence

Meta’s Open-Source Llama 4: 2‑Trillion‑Parameter Behemoth Redefines AI

Meta’s newly released Llama 4 models—Maverick with 4 020 billion total parameters and Scout with 1 090 billion—feature a 128‑expert MoE, 10 million‑token context, native multimodal fusion, and FP8 training, delivering benchmark‑leading performance that outpaces GPT‑4o, Gemini 2.0 Flash and DeepSeek v3, while being openly available on Hugging Face and GitHub.

BenchmarkFP8 trainingLlama 4
0 likes · 8 min read
Meta’s Open-Source Llama 4: 2‑Trillion‑Parameter Behemoth Redefines AI
DataFunTalk
DataFunTalk
Apr 6, 2025 · Artificial Intelligence

Meta Unveils Llama 4: New Multimodal AI Models with Mixture‑of‑Experts Architecture and 10 Million‑Token Context

Meta announced the Llama 4 series—Scout, Maverick and Behemoth—featuring multimodal capabilities, Mixture‑of‑Experts design, up to 10 million‑token context windows, and state‑of‑the‑art performance on STEM, multilingual and image benchmarks, with models now downloadable from llama.com and Hugging Face.

Llama 4Mixture of ExpertsModel Training
0 likes · 14 min read
Meta Unveils Llama 4: New Multimodal AI Models with Mixture‑of‑Experts Architecture and 10 Million‑Token Context
AI Frontier Lectures
AI Frontier Lectures
Apr 3, 2025 · Artificial Intelligence

How ChartMoE Uses Sparse MoE to Master Chart Understanding and Preserve General Knowledge

ChartMoE, an oral paper at ICLR 2025, introduces a multi‑stage alignment training pipeline and a diversified MoE Connector that dramatically improves chart comprehension while maintaining performance on general multimodal tasks, backed by extensive data construction, training recipes, and thorough evaluations.

Chart UnderstandingChartMoEMixture of Experts
0 likes · 10 min read
How ChartMoE Uses Sparse MoE to Master Chart Understanding and Preserve General Knowledge
Baidu Geek Talk
Baidu Geek Talk
Apr 2, 2025 · Artificial Intelligence

DeepSeek-VL2 Multimodal Model: Architecture, Training, and Code Walkthrough

DeepSeek‑VL2 is a state‑of‑the‑art multimodal model built on a Mixture‑of‑Experts architecture that combines a SigLIP‑L vision encoder with dynamic tiling, a two‑layer VL adaptor, and a DeepSeek‑MoE language model using Multi‑head Latent Attention, trained in three stages on diverse visual‑language and text data, and achieving strong results on benchmarks such as DocVQA and TextVQA, with full implementation and inference code available in PaddleMIX.

DeepSeek-VL2InferenceMixture of Experts
0 likes · 36 min read
DeepSeek-VL2 Multimodal Model: Architecture, Training, and Code Walkthrough
AI Algorithm Path
AI Algorithm Path
Mar 26, 2025 · Artificial Intelligence

DeepSeek V3-0324 Upgrade Delivers Smarter Coding and Higher Code Quality

The DeepSeek V3-0324 model, released on March 24, 2025 with 6.85 trillion parameters and a Mixture‑of‑Experts architecture, is fully open‑source on Hugging Face and brings notable upgrades in coding ability, structured responses, stability, generation length, and speed, while offering performance comparable to leading closed‑source models such as Claude 3.7.

AI code generationCoding AIDeepSeek
0 likes · 10 min read
DeepSeek V3-0324 Upgrade Delivers Smarter Coding and Higher Code Quality
AntTech
AntTech
Mar 18, 2025 · Artificial Intelligence

MoLE: Decoding by Mixture of Layer Experts Alleviates Hallucination in Large Vision-Language Models

Researchers from Ant Insurance and Zhejiang University propose MoLE, a Mixture of Layer Experts decoding method that reduces hallucinations in large vision‑language models, demonstrating state‑of‑the‑art performance on LVLM benchmarks and enabling reliable end‑to‑end medical‑record‑to‑claim automation.

AIMixture of ExpertsVision-Language Models
0 likes · 7 min read
MoLE: Decoding by Mixture of Layer Experts Alleviates Hallucination in Large Vision-Language Models
NewBeeNLP
NewBeeNLP
Mar 11, 2025 · Artificial Intelligence

How DeepSeek’s New Architecture Redefines LLM Efficiency and Performance

This article analyzes DeepSeek’s recent breakthroughs—including the Multi‑Head Latent Attention (MLA), Group Relative Policy Optimization (GRPO), and a refined Mixture‑of‑Experts design—along with its three‑stage training pipeline, RL‑only R1‑Zero variant, and benchmark comparisons against GPT‑4o‑Mini and Llama 3.1, highlighting both gains and remaining challenges.

DeepSeekLLMMixture of Experts
0 likes · 18 min read
How DeepSeek’s New Architecture Redefines LLM Efficiency and Performance
Tencent Cloud Developer
Tencent Cloud Developer
Mar 5, 2025 · Artificial Intelligence

DeepSeek Series Overview: Core Technologies, Model Innovations, and Product Highlights

The article delivers a PPT‑style deep dive into the DeepSeek series—from the original LLM through DeepSeek‑MoE, Math, V2, V3 and R1—highlighting core innovations such as Multi‑Head Latent Attention, fine‑grained MoE, GRPO reinforcement learning, Multi‑Token Prediction, DualPipe parallelism and FP8 training that together achieve high performance at a fraction of traditional costs, and notes their integration into Tencent’s OlaChat intelligent assistant.

AIDeepSeekFP8 training
0 likes · 21 min read
DeepSeek Series Overview: Core Technologies, Model Innovations, and Product Highlights
Architect
Architect
Mar 2, 2025 · Artificial Intelligence

Demystifying Mixture of Experts: How MoE Boosts LLMs and Vision Models

This article explains the Mixture of Experts (MoE) architecture, detailing experts, routers, dense vs. sparse layers, load‑balancing strategies such as KeepTopK, auxiliary loss, capacity constraints, the Switch Transformer simplification, and how MoE is applied to both language and vision models, illustrated with concrete examples and parameter counts.

Mixture of ExpertsMoESparse Models
0 likes · 17 min read
Demystifying Mixture of Experts: How MoE Boosts LLMs and Vision Models
DataFunTalk
DataFunTalk
Feb 28, 2025 · Artificial Intelligence

DeepSeek LLM Series (V1‑V3) and R1: Architecture, Training Strategies, Evaluation, and Distillation

An in‑depth overview of the DeepSeek LLM series (V1‑V3) and the R1 models, covering their architectures, scaling‑law experiments, data pipelines, training strategies—including MoE, MLA, FP8, multi‑step learning‑rate scheduling, reinforcement learning, and extensive evaluation results, as well as knowledge‑distillation techniques.

Mixture of Expertsscaling laws
0 likes · 36 min read
DeepSeek LLM Series (V1‑V3) and R1: Architecture, Training Strategies, Evaluation, and Distillation
Tencent Cloud Developer
Tencent Cloud Developer
Feb 27, 2025 · Artificial Intelligence

DeepSeek LLM Series (V1‑V3, R1) Technical Overview and Analysis

The DeepSeek technical overview details the evolution from the dense 67 B V1 model through the 236 B MoE‑based V2 and 671 B V3 with FP8 training, to the RL‑only R1 series that learns reasoning without supervision, highlighting innovations such as Grouped‑Query Attention, Multi‑Head Latent Attention, load‑balancing‑free MoE, Multi‑Token Prediction, and knowledge distillation, and reporting state‑of‑the‑art benchmark results and open‑source reproduction projects.

AI researchDeepSeekMixture of Experts
0 likes · 37 min read
DeepSeek LLM Series (V1‑V3, R1) Technical Overview and Analysis
IT Architects Alliance
IT Architects Alliance
Feb 26, 2025 · Artificial Intelligence

DeepSeek Large Model: Core Architecture, Key Technologies, and Training Strategies

The article provides an in‑depth overview of DeepSeek’s large language model, detailing its mixture‑of‑experts and Transformer foundations, novel attention mechanisms, load‑balancing, multi‑token prediction, FP8 mixed‑precision training, and various training regimes such as knowledge distillation and reinforcement learning.

DeepSeekFP8MLA
0 likes · 18 min read
DeepSeek Large Model: Core Architecture, Key Technologies, and Training Strategies
Architect
Architect
Feb 24, 2025 · Artificial Intelligence

Inside MoBA: A Sparse Attention Framework for 10‑Million‑Token Contexts

The article details the development, architectural evolution, and practical challenges of MoBA—a sparse attention framework inspired by Mixture‑of‑Experts that scales LLM context length to 10 M tokens, supports seamless switching between full and sparse attention, and is now released as a minimal open‑source solution.

AI ArchitectureContext ParallelLLM training
0 likes · 13 min read
Inside MoBA: A Sparse Attention Framework for 10‑Million‑Token Contexts
Architecture Digest
Architecture Digest
Feb 24, 2025 · Artificial Intelligence

MoBA: Mixture of Block Attention for Long‑Context Large Language Models

The article introduces MoBA, a Mixture‑of‑Block‑Attention mechanism that applies Mixture‑of‑Experts principles to transformer attention, enabling efficient long‑context processing for large language models while maintaining performance comparable to full attention through sparse, trainable block selection and seamless switching.

Attention MechanismLLMMixture of Experts
0 likes · 12 min read
MoBA: Mixture of Block Attention for Long‑Context Large Language Models
Architect
Architect
Feb 21, 2025 · Artificial Intelligence

DeepSeek Model Innovations: Architecture, Training Methods, and Performance Evaluation

This article reviews DeepSeek's recent breakthroughs, including the MLA attention redesign, GRPO alignment algorithm, MoE enhancements, multi‑stage training pipelines (SFT, RL, preference tuning, distillation), and comparative performance against GPT‑4o‑Mini and Llama 3.1, highlighting both strengths and remaining challenges.

DeepSeekMixture of ExpertsModel Evaluation
0 likes · 16 min read
DeepSeek Model Innovations: Architecture, Training Methods, and Performance Evaluation
Architect's Alchemy Furnace
Architect's Alchemy Furnace
Feb 19, 2025 · Artificial Intelligence

How DeepSeek Beats GPT-4 with 10× Less Compute: Inside the AI Efficiency Revolution

This article examines DeepSeek's breakthrough AI techniques—including a revamped MoE architecture, aggressive data distillation, ultra‑low‑energy training, novel multi‑stage training strategies, and custom AI chips—that enable a 7B model to rival GPT‑4 while consuming a fraction of the resources.

AI efficiencyData distillationDeepSeek
0 likes · 9 min read
How DeepSeek Beats GPT-4 with 10× Less Compute: Inside the AI Efficiency Revolution
Architect's Alchemy Furnace
Architect's Alchemy Furnace
Feb 19, 2025 · Artificial Intelligence

DeepSeek’s Self‑Correction: Transforming AI Reliability and Safety

The article explores DeepSeek’s innovative self‑correction system—combining a Mixture‑of‑Experts architecture with reinforcement‑learning feedback—to achieve real‑time error detection, dynamic knowledge‑graph updates, and enhanced safety in high‑risk fields like autonomous driving and medical diagnostics.

AI SafetyDeepSeekMixture of Experts
0 likes · 9 min read
DeepSeek’s Self‑Correction: Transforming AI Reliability and Safety
IT Architects Alliance
IT Architects Alliance
Feb 15, 2025 · Artificial Intelligence

DeepSeek: Architecture, Core Technologies, Training Strategies, and Comparative Analysis

The article provides an in‑depth overview of DeepSeek's transformer‑based foundation, Mixture‑of‑Experts architecture, novel attention mechanisms, multi‑token prediction, FP8 mixed‑precision training, knowledge distillation, reinforcement‑learning approaches, and compares its performance and cost advantages against leading models such as GPT and Gemini.

AI model architectureDeepSeekFP8 training
0 likes · 29 min read
DeepSeek: Architecture, Core Technologies, Training Strategies, and Comparative Analysis
Lao Guo's Learning Space
Lao Guo's Learning Space
Feb 15, 2025 · Artificial Intelligence

What Is deepseek-MoE? Understanding the Mixture‑of‑Experts Architecture

The article explains deepseek-MoE (Mixture of Experts), describing its full English name, Chinese translation, how a gating network selects and weights multiple expert models for each input, and uses an analogy to illustrate load‑balancing and the divide‑and‑conquer design in large AI models.

AI ArchitectureMixture of Expertsdeepseek-MoE
0 likes · 2 min read
What Is deepseek-MoE? Understanding the Mixture‑of‑Experts Architecture
Tencent Technical Engineering
Tencent Technical Engineering
Feb 14, 2025 · Artificial Intelligence

Technical Overview of DeepSeek Series Models and Innovations

The DeepSeek series introduces a refined Mixture‑of‑Experts architecture with fine‑grained expert partitioning, shared experts, and learnable load‑balancing, alongside innovations such as Group Relative Policy Optimization, Multi‑Head Latent Attention, Multi‑Token Prediction, mixed‑precision FP8 training, and the R1/R1‑Zero models that use Long‑CoT reasoning, reinforcement‑learning pipelines, and distillation to achieve OpenAI‑comparable performance at lower cost.

AIDeepSeekMixture of Experts
0 likes · 25 min read
Technical Overview of DeepSeek Series Models and Innovations
AI Algorithm Path
AI Algorithm Path
Feb 12, 2025 · Artificial Intelligence

Essential DeepSeek‑R1 Reading List: Papers Behind the 2025 Hottest LLM

This article compiles a curated reading list of foundational and recent research papers—from the original Transformer to chain‑of‑thought, mixture‑of‑experts, and reinforcement‑learning studies—that together explain the breakthroughs behind DeepSeek‑R1 and guide readers through the technical evolution of modern large language models.

DeepSeekMixture of ExpertsResearch Papers
0 likes · 15 min read
Essential DeepSeek‑R1 Reading List: Papers Behind the 2025 Hottest LLM
Data Thinking Notes
Data Thinking Notes
Feb 11, 2025 · Artificial Intelligence

Why DeepSeek V3 and R1 Are Redefining LLM Efficiency and Power

This article analyzes DeepSeek's V3 and R1 large language models, detailing their low‑cost Mixture‑of‑Experts architecture, Multi‑Head Latent Attention redesign, distributed training optimizations, and reasoning‑focused innovations that together challenge traditional GPU/NPU compute demands.

AI inferenceDeepSeekMLA
0 likes · 15 min read
Why DeepSeek V3 and R1 Are Redefining LLM Efficiency and Power
Architect
Architect
Feb 10, 2025 · Artificial Intelligence

Evolution of DeepSeek Mixture‑of‑Experts (MoE) Architecture from V1 to V3

This article reviews the development of DeepSeek's Mixture-of-Experts (MoE) models, tracing their evolution from the original DeepSeekMoE V1 through V2 to V3, detailing architectural innovations such as fine‑grained expert segmentation, shared‑expert isolation, load‑balancing losses, device‑limited routing, and the shift from softmax to sigmoid gating.

DeepSeekLLMMixture of Experts
0 likes · 21 min read
Evolution of DeepSeek Mixture‑of‑Experts (MoE) Architecture from V1 to V3
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Feb 8, 2025 · Artificial Intelligence

Why DeepSeek V3 and R1 Are Redefining Low‑Cost AI: Architecture, Training Tricks, and Industry Impact

This article analyses DeepSeek's V3 and R1 models, explaining how their innovative MoE architecture, Multi‑Head Latent Attention, low‑cost training strategies, and distributed‑training optimizations deliver high‑performance large language models while reducing GPU/NPU demand and sparking industry excitement.

AI inferenceDeepSeekMixture of Experts
0 likes · 16 min read
Why DeepSeek V3 and R1 Are Redefining Low‑Cost AI: Architecture, Training Tricks, and Industry Impact
Alibaba Cloud Developer
Alibaba Cloud Developer
Feb 7, 2025 · Artificial Intelligence

Why DeepSeek V3 Achieves Low Training Costs: Inside Its AI Innovations

This article provides a comprehensive analysis of DeepSeek's large‑language‑model technology, covering the company's background, model capabilities, remarkably low training and inference costs, and the core architectural and algorithmic innovations such as MoE, MLA attention, FP8 mixed‑precision, and the DualPipe pipeline that enable efficient large‑scale AI deployment.

AI ArchitectureDeepSeekFP8 training
0 likes · 19 min read
Why DeepSeek V3 Achieves Low Training Costs: Inside Its AI Innovations
Tencent Cloud Developer
Tencent Cloud Developer
Feb 6, 2025 · Artificial Intelligence

DeepSeek V Series: Technical Overview of Scaling Laws, Grouped Query Attention, and Mixture‑of‑Experts

The article reviews DeepSeek’s V‑series papers, explaining how scaling‑law insights, Grouped Query Attention, a depth‑first design, loss‑free load balancing, multi‑token prediction and Multi‑Head Latent Attention together enable economical mixture‑of‑experts LLMs that rival closed‑source models while cutting compute and hardware costs.

DeepSeekGrouped Query AttentionMixture of Experts
0 likes · 13 min read
DeepSeek V Series: Technical Overview of Scaling Laws, Grouped Query Attention, and Mixture‑of‑Experts
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Feb 5, 2025 · Artificial Intelligence

What Optimizations Power DeepSeek’s High‑Efficiency LLMs?

The article enumerates DeepSeek’s extensive technical optimizations—including Grouped Query Attention, Multi‑head Latent Attention, Mixture‑of‑Experts, 4D parallelism, quantization, and multi‑token prediction—that together enable cheap, high‑performance large language models.

4D parallelismDeepSeekGrouped Query Attention
0 likes · 8 min read
What Optimizations Power DeepSeek’s High‑Efficiency LLMs?