Tagged articles

MoE

62 articles · Page 1 of 1

Jun 11, 2026 · Artificial Intelligence

Keye-VL-2.0 Brings DeepSeek Sparse Attention to Multimodal AI – Report Released

Keye‑VL‑2.0, an open‑source MoE multimodal foundation model, tackles hour‑level video understanding and agentic intelligence by embedding DeepSeek Sparse Attention into a GQA‑based architecture, enabling near‑lossless 256 K token context, four‑stage pre‑training, diverse RL distillation techniques, and achieving state‑of‑the‑art results on long‑video benchmarks, with weights publicly released.

MoEMultimodalRL distillation

0 likes · 8 min read

Keye-VL-2.0 Brings DeepSeek Sparse Attention to Multimodal AI – Report Released

Old Zhang's AI Learning

Jun 11, 2026 · Artificial Intelligence

Google’s 26B DiffusionGemma Model Delivers 1000+ Tokens/s – Runs on a 4090

DiffusionGemma, Google DeepMind’s 26B MoE model that generates 256‑token blocks via diffusion, achieves over 1000 tokens per second on H100/H200 GPUs, offers FP8 and NVFP4 quantized versions with near‑lossless accuracy, and can be deployed locally with vLLM Docker images, though it incurs higher first‑token latency and limited concurrency.

26B modelDiffusionGemmaFP8 quantization

0 likes · 10 min read

Google’s 26B DiffusionGemma Model Delivers 1000+ Tokens/s – Runs on a 4090

Su San Talks Tech

Jun 9, 2026 · Artificial Intelligence

Zero‑Cost Unlimited‑Token Access to Qwen 3.6: A Step‑by‑Step Guide

This article explains how developers can bypass token‑cost barriers by using iFlytek’s MaaS platform to obtain free, unlimited‑token access to the Qwen 3.6‑35B‑A3B model, details the model’s MoE architecture and benchmark performance, and provides a complete Java integration tutorial with code samples and practical use‑case suggestions.

AIAPIJava

0 likes · 16 min read

Zero‑Cost Unlimited‑Token Access to Qwen 3.6: A Step‑by‑Step Guide

Old Zhang's AI Learning

Jun 1, 2026 · Artificial Intelligence

NVIDIA Unveils Nemotron 3 Ultra: The Largest US Open‑Source LLM Boosting Agent Capabilities

NVIDIA released Nemotron 3 Ultra, a 550 B‑parameter open‑source LLM with 55 B active MoE parameters, hybrid Mamba‑Transformer architecture, 1 M token context, and three core innovations that deliver superior MMLU, code, math scores and up to 5× throughput versus rivals, though weights are not yet public.

Large Language ModelMambaMoE

0 likes · 8 min read

NVIDIA Unveils Nemotron 3 Ultra: The Largest US Open‑Source LLM Boosting Agent Capabilities

Old Zhang's AI Learning

May 31, 2026 · Artificial Intelligence

Qwen3.6-35B-A3B NVFP4: A Stable, Highly Compressed Quantized Model

NVIDIA's NVFP4 quantization reduces Qwen3.6-35B-A3B's memory footprint by threefold with almost no accuracy loss, offers plug‑and‑play deployment via vLLM, and outperforms other 4‑bit formats on Hopper/Blackwell GPUs, making it a practical choice for production AI workloads.

MoENVFP4Quantization

0 likes · 13 min read

Qwen3.6-35B-A3B NVFP4: A Stable, Highly Compressed Quantized Model

Xiaomi Tech

May 30, 2026 · Artificial Intelligence

How Xiaomi’s MiMo V2.5 Achieves 99% API Price Cut with Full‑Stack Inference Optimizations

The MiMo‑V2.5 series combines Hybrid Sliding‑Window Attention, Mixture‑of‑Experts and multimodal support with a complete redesign of KVCache management, tiered caching, prefix‑tree logic and scheduling, compressing KVCache to about one‑seventh of full‑attention models and delivering up to 40% faster Prefill, 30% lower TTFT and dramatically reduced inference costs that enable a 99% API price reduction.

Hybrid SWAInference OptimizationKVCache

0 likes · 12 min read

How Xiaomi’s MiMo V2.5 Achieves 99% API Price Cut with Full‑Stack Inference Optimizations

Machine Heart

May 28, 2026 · Artificial Intelligence

How Orbit Enables Single-Node RL Fine-Tuning of Trillion-Parameter Models like DeepSeek‑V4

Orbit’s adapter‑first design freezes a low‑precision base model and updates only a small adapter, allowing trillion‑parameter MoE models such as DeepSeek‑V4 to be RL‑fine‑tuned on a single 8×B200 node while keeping training and rollout precision aligned and memory usage within budget.

DeepSeekMoEOrbit framework

0 likes · 9 min read

How Orbit Enables Single-Node RL Fine-Tuning of Trillion-Parameter Models like DeepSeek‑V4

Tencent Technical Engineering

May 24, 2026 · Artificial Intelligence

How Tsinghua & Tencent Mixed‑X Won the MLSys 2026 MoE Inference Challenge with a 4.1× Speedup

The Tsinghua‑Tencent Mixed‑X team captured the MLSys 2026 MoE inference optimization championship by analyzing NPU bottlenecks, redesigning data movement, applying expert‑level sharding, continuous DMA, PSUM batching, and an Agent‑based optimizer, achieving a 4.1× end‑to‑end speedup while preserving bit‑level output fidelity.

Agent optimizerInference OptimizationMLSys 2026

0 likes · 14 min read

How Tsinghua & Tencent Mixed‑X Won the MLSys 2026 MoE Inference Challenge with a 4.1× Speedup

Machine Learning Algorithms & Natural Language Processing

May 16, 2026 · Artificial Intelligence

Token Superposition Training Accelerates LLM Pre‑training 2.5× Without Changing Architecture

Token Superposition Training (TST) speeds up large‑language‑model pre‑training by up to 2.5× without altering model architecture or compute budget, using a superposition phase that averages token embeddings into bags and predicts groups of tokens, followed by a standard recovery phase, as demonstrated on 10B‑parameter MoE and smaller models.

LLM pretrainingMCE lossMoE

0 likes · 10 min read

Token Superposition Training Accelerates LLM Pre‑training 2.5× Without Changing Architecture

Machine Learning Algorithms & Natural Language Processing

May 14, 2026 · Artificial Intelligence

Boosting LLM Pre‑training 2.5× Without Architecture Changes or Extra Compute

Nous Research introduces Token Superposition Training, which groups tokens into bags, averages their embeddings, and predicts token groups without altering model architecture or adding compute, achieving up to 2.5× faster pre‑training while maintaining standard inference.

LLM pretrainingMCE lossMoE

0 likes · 10 min read

Boosting LLM Pre‑training 2.5× Without Architecture Changes or Extra Compute

SuanNi

May 12, 2026 · Artificial Intelligence

AntAngelMed: 6.1B‑Activated MoE Model Tops Three Medical Benchmarks

AntAngelMed, a 100‑billion‑parameter medical LLM using a 6.1 billion‑parameter MoE architecture, achieves performance comparable to a 40 billion‑parameter dense model, exceeds 200 tokens/s inference speed, and ranks first on HealthBench, MedAIBench and MedBench, with a three‑stage training pipeline and extensive efficiency optimizations.

HealthBenchLarge Language ModelMedAIBench

0 likes · 6 min read

AntAngelMed: 6.1B‑Activated MoE Model Tops Three Medical Benchmarks

Machine Learning Algorithms & Natural Language Processing

May 6, 2026 · Artificial Intelligence

Why DeepSeek‑V4’s MFU Drops: Parallel Strategies and Compute‑Communication Overlap

The article dissects DeepSeek‑V4’s shift from dense to MoE models, explains why MFU plummets despite sufficient expert dimensions, and details how a carefully designed GPU parallel strategy—combining DP, ZeRO‑1, PP, EP and the new Waved‑EP kernel—overlaps communication and computation to reclaim throughput on 8‑card NVLink nodes linked by InfiniBand.

DeepSeek-V4Expert ParallelGPU Distributed Training

0 likes · 19 min read

Why DeepSeek‑V4’s MFU Drops: Parallel Strategies and Compute‑Communication Overlap

Architects' Tech Alliance

May 4, 2026 · Artificial Intelligence

DeepSeek‑V4 Inference Cost Showdown: NVIDIA H100 vs Ascend 950PR vs 910C

DeepSeek‑V4, a 1.6‑trillion‑parameter MoE model with mixed‑precision attention, is benchmarked on three accelerators—NVIDIA H100, Huawei Ascend 910C, and Ascend 950PR—showing that the 950PR delivers the lowest per‑token cost in both Prefill and Decode phases, while the H100 offers the highest raw performance at a far greater price.

DeepSeek-V4FP8Huawei Ascend 950PR

0 likes · 8 min read

DeepSeek‑V4 Inference Cost Showdown: NVIDIA H100 vs Ascend 950PR vs 910C

Architects' Tech Alliance

May 2, 2026 · Artificial Intelligence

Eight Chinese AI Chips Achieve Day‑Zero DeepSeek‑V4 Compatibility

The article explains how eight domestic AI chip makers—Huawei Ascend, Cambricon, HaiGuang, Moore Threads, Kunlun, Pingtouge, Muxi, and Tianshu—simultaneously completed full‑link compatibility, performance tuning, and stability verification for DeepSeek‑V4 on release day, detailing each vendor’s technical path, shared ecosystem breakthroughs, and the broader impact on the AI industry.

AI chipsDay0 adaptationDeepSeek-V4

0 likes · 11 min read

Eight Chinese AI Chips Achieve Day‑Zero DeepSeek‑V4 Compatibility

Architects' Tech Alliance

May 1, 2026 · Artificial Intelligence

How DeepSeek V4 Triggers a Global AI Price War with OpenAI

DeepSeek V4’s open‑source 1 M‑token MoE model delivers benchmark scores of MMLU 88.7, C‑Eval 92.1 and HumanEval 69.5, while its 4‑bit AWQ quantization, PagedAttention memory management and FlashAttention acceleration cut inference costs and latency, prompting rivals such as Anthropic, OpenAI, Baidu and Huawei to slash prices and boost efficiency in a fierce market battle.

AI efficiencyDeepSeek-V4Large Language Model

0 likes · 9 min read

How DeepSeek V4 Triggers a Global AI Price War with OpenAI

DataFunTalk

Apr 28, 2026 · Artificial Intelligence

Manifold AI’s WorldScape 0.2 Tops WorldArena: How MoE Drives Superior Physics and 3D Understanding

Manifold AI’s WorldScape 0.2 achieved the highest overall score on the embodied world‑model benchmark WorldArena, outperforming giants like Google and Nvidia by excelling in comprehensive perception, physics compliance, and 3D accuracy while using only about 10 % of the parameters of competing models, thanks to a newly introduced MoE architecture.

Embodied AIMoEScaling Law

0 likes · 9 min read

Manifold AI’s WorldScape 0.2 Tops WorldArena: How MoE Drives Superior Physics and 3D Understanding

DeepHub IMBA

Apr 27, 2026 · Artificial Intelligence

DeepSeek‑V4 Deep Dive: Engineering Million‑Token Context Efficiency

The article provides a thorough technical analysis of DeepSeek‑V4, detailing how mixed sparse attention (CSA + HCA), manifold‑constrained hyper‑connections, the Muon optimizer, FP4 quantization, and a suite of infrastructure tricks enable stable training and inference with up to one‑million token contexts while achieving state‑of‑the‑art benchmark results.

CSADeepSeek-V4FP4 quantization

0 likes · 22 min read

DeepSeek‑V4 Deep Dive: Engineering Million‑Token Context Efficiency

Machine Learning Algorithms & Natural Language Processing

Apr 25, 2026 · Artificial Intelligence

Why DeepSeek‑V4 Took Twice as Long: Inside the Training‑Stability Challenges and Engineering Hacks

The DeepSeek‑V4 technical report reveals that the model’s doubled training time stems from massive token and parameter scaling, severe training‑stability issues in MoE layers, and a suite of engineering solutions—including Anticipatory Routing, SwiGLU Clamping, specialist expert training, and a custom sandbox cluster—while also exposing high hallucination rates despite impressive benchmark performance.

DeepSeek-V4Generative Reward ModelLLM

0 likes · 12 min read

Why DeepSeek‑V4 Took Twice as Long: Inside the Training‑Stability Challenges and Engineering Hacks

Architects' Tech Alliance

Apr 24, 2026 · Artificial Intelligence

DeepSeek V4 Launches with 1M‑Token Context, Dual Versions and Native Chinese Chip Support

On April 24, 2026 DeepSeek released the V4 preview featuring two models—V4‑Pro with a 1.6 T‑parameter MoE architecture and V4‑Flash with 284 B parameters—both offering 1 million token context, up to 384 K output tokens, new step‑wise reasoning modes, and full native compatibility with Huawei Ascend and Cambricon chips, while delivering major efficiency gains and benchmark‑leading performance.

1M token contextCambriconDeepSeek

0 likes · 7 min read

DeepSeek V4 Launches with 1M‑Token Context, Dual Versions and Native Chinese Chip Support

Old Zhang's AI Learning

Apr 23, 2026 · Artificial Intelligence

DeepSeek Quietly Open‑Sources TileKernels to Push GPU Performance to Its Limits

DeepSeek has released TileKernels, a GPU kernel library written in the TileLang DSL, that targets H100/H200/B200 GPUs and claims to approach hardware limits in compute intensity and memory bandwidth, offering MoE routing, FP8/FP4 quantization, and dual‑language PyTorch references for deep‑learning engineers.

FP8 quantizationGPU OptimizationLLM training

0 likes · 9 min read

DeepSeek Quietly Open‑Sources TileKernels to Push GPU Performance to Its Limits

PaperAgent

Apr 22, 2026 · Artificial Intelligence

Alibaba Unveils Four New Open‑Source Qwen3.6 Models: 27B Dense and 35B‑A3B MoE

Alibaba has added four new open‑source weight versions to its Qwen3.6 series, featuring the 27‑billion‑parameter dense multimodal model Qwen3.6‑27B and the 35‑billion‑parameter sparse expert model Qwen3.6‑35B‑A3B, both designed for stable, real‑world coding tasks and outperforming their Qwen3.5 predecessors.

AI agentsAlibabaDense Model

0 likes · 4 min read

Alibaba Unveils Four New Open‑Source Qwen3.6 Models: 27B Dense and 35B‑A3B MoE

PaperAgent

Apr 21, 2026 · Artificial Intelligence

OpenMythos: Rebuilding Claude Mythos with Recursive Transformers and MoE

OpenMythos is an open‑source PyTorch reimplementation of Anthropic's Claude Mythos that uses a mixed‑expert routed recurrent Transformer, introduces Recursive Depth Transformers, Multi‑Latent Attention, and several stability mechanisms, and demonstrates parameter‑efficient scaling backed by empirical studies.

AI ArchitectureClaude MythosMoE

0 likes · 6 min read

OpenMythos: Rebuilding Claude Mythos with Recursive Transformers and MoE

HyperAI Super Neural

Apr 21, 2026 · Artificial Intelligence

Qwen3.6-35B-A3B Boosts Agent Programming: 3B Activation Beats Gemma4-31B

Qwen3.6-35B-A3B, the first open‑source Qwen3.6 model, achieves markedly better scores than Qwen3.5‑35B‑A3B and Gemma4‑31B on Terminal‑Bench2.0, NL2Repo, and QwenClawBench, adds a thought‑process retention option, and is accessible via HyperAI’s ready‑to‑run notebook with free compute credits.

Agent ProgrammingHyperAILarge Language Model

0 likes · 4 min read

Qwen3.6-35B-A3B Boosts Agent Programming: 3B Activation Beats Gemma4-31B

Lao Guo's Learning Space

Apr 8, 2026 · Artificial Intelligence

2026 Qwen Model Comparison: Choose the Right Qwen for Your Mac Studio

An in‑depth 2026 comparative review of Alibaba’s Qwen series (Qwen2.5, Qwen3, Qwen3.5) evaluates architecture, performance, speed and VRAM usage on Mac Studio, ranks each variant, and provides concrete model‑selection guidance for different memory configurations, highlighting the MoE‑based Qwen3.5 as the optimal choice.

AI performanceLarge Language ModelMac Studio

0 likes · 9 min read

2026 Qwen Model Comparison: Choose the Right Qwen for Your Mac Studio

ShiZhen AI

Mar 17, 2026 · Artificial Intelligence

Kimi’s Attention Residuals Swap a Decade-Old Residual Trick for 1.25× Faster 48B MoE

The Kimi team introduces Attention Residuals, a softmax‑based replacement for the uniform residual connections used in Transformers for a decade, enabling selective aggregation of layer histories, reducing hidden‑state growth, and achieving a 1.25× compute‑efficiency gain on a 48‑billion‑parameter MoE model with less than 2% inference latency increase.

Attention ResidualsMoEResidual Connection

0 likes · 10 min read

Kimi’s Attention Residuals Swap a Decade-Old Residual Trick for 1.25× Faster 48B MoE

Old Zhang's AI Learning

Mar 13, 2026 · Artificial Intelligence

Nvidia’s New OpenClaw‑Optimized Model Cracks Top‑5 on PinchBench – Free to Use

Nvidia’s open‑source Nemotron‑3‑Super model achieves an 85.6% success rate on the PinchBench OpenClaw benchmark, ranking in the top five (the only open‑source entry), and the article explains its architecture, quantization, training pipeline, performance numbers, usage options, and practical limitations.

AI coding agentMoENVFP4

0 likes · 10 min read

Nvidia’s New OpenClaw‑Optimized Model Cracks Top‑5 on PinchBench – Free to Use

Old Zhang's AI Learning

Feb 26, 2026 · Artificial Intelligence

Ultimate Guide to Local Deployment of Qwen3.5 Models (27B‑397B)

This guide reviews the Qwen3.5 model lineup, explains mixed‑inference and MoE architecture, presents benchmark comparisons with GPT‑5.2, Claude 4.5 and Gemini‑3 Pro, evaluates 4‑bit and 3‑bit quantization loss, outlines hardware requirements, and provides step‑by‑step deployment options using llama.cpp or llama‑server.

Large Language ModelMoEQuantization

0 likes · 14 min read

Ultimate Guide to Local Deployment of Qwen3.5 Models (27B‑397B)

Baobao Algorithm Notes

Feb 25, 2026 · Artificial Intelligence

Exploring Qwen 3.5: Small‑Scale MoE Models, Architecture, and Deployment Guides

This article reviews the three open‑source Qwen 3.5 models—including a 35B MoE, a 122B MoE, and a 27B dense version—detailing their parameter layouts, core attention designs, context length, inference performance, hardware requirements, and provides step‑by‑step code examples for loading them with Hugging Face Transformers and vLLM.

AILarge Language ModelMoE

0 likes · 10 min read

Exploring Qwen 3.5: Small‑Scale MoE Models, Architecture, and Deployment Guides

Old Zhang's AI Learning

Feb 19, 2026 · Artificial Intelligence

Inside GLM-5: Training Techniques, Architecture Innovations, and Benchmark Performance

The article dissects GLM-5’s 744B‑parameter MoE design, 28.5 T token training corpus, novel Muon Split and MLA‑256 optimizations, DSA sparse attention, a fully asynchronous RL pipeline, extensive domestic chip adaptation, and benchmark results that place it on par with Claude Opus 4.5 and ahead of Gemini 3 Pro.

AI ArchitectureAgentic RLDSA

0 likes · 13 min read

Inside GLM-5: Training Techniques, Architecture Innovations, and Benchmark Performance

Node.js Tech Stack

Feb 16, 2026 · Artificial Intelligence

Qwen 3.5 Launch: 17B Active Parameters Take on GPT‑5.2

Qwen 3.5, an open‑source 397B‑parameter model that activates only 17B parameters, uses a hybrid MoE‑Gated Delta architecture, offers native multimodal support and a default chain‑of‑thought mode, and achieves benchmark scores comparable to GPT‑5.2, Claude 4.5 Opus and Gemini 3 Pro across code, math, agent and vision tasks.

AI modelGated Delta NetworksMoE

0 likes · 9 min read

Qwen 3.5 Launch: 17B Active Parameters Take on GPT‑5.2

Old Zhang's AI Learning

Feb 9, 2026 · Artificial Intelligence

GLM-5 Emerges First, Built on DeepSeek Tech, Triggering a 40% Stock Surge

An anonymous OpenRouter model dubbed "Pony Alpha" was verified as the new 745B‑parameter GLM-5, which reuses DeepSeek‑V3 architecture, supports sparse attention and multi‑token prediction, and has already caused a near‑40% jump in Zhipu AI’s stock while hinting at upcoming integration into the Transformers library.

DeepSeekGLM-5Large Language Model

0 likes · 3 min read

GLM-5 Emerges First, Built on DeepSeek Tech, Triggering a 40% Stock Surge

Old Zhang's AI Learning

Feb 9, 2026 · Artificial Intelligence

Qwen 3.5 Emerges; ByteDance and DeepSeek Set to Release Flagship LLMs for Spring Festival

The LMSYS Chatbot Arena now shows Qwen 3.5 (codenamed Karp-001/002) alongside ByteDance's Pisces‑llm models and DeepSeek‑V4, with new Transformers configs and hints of an Active‑3B MoE architecture, suggesting a fresh wave of flagship large language models arriving for the Spring Festival.

ByteDanceDeepSeekMoE

0 likes · 4 min read

Qwen 3.5 Emerges; ByteDance and DeepSeek Set to Release Flagship LLMs for Spring Festival

Old Zhang's AI Learning

Feb 3, 2026 · Artificial Intelligence

Step‑3.5‑Flash: Lightning‑Fast Inference with 196B Params, Only 11B Active (vLLM)

Step‑3.5‑Flash, a 196‑billion‑parameter open‑source LLM that activates only 11 B per token via a Mixture‑of‑Experts design, delivers 3‑plus‑times faster inference, matches top‑tier closed‑source models on SWE‑bench and other benchmarks, supports 256 K context, runs on consumer‑grade hardware, and is already integrated into vLLM, SGLang, and Claude Code, though it has known token‑efficiency and domain‑stability limitations.

LLM BenchmarkMoEStep-3.5-Flash

0 likes · 11 min read

Step‑3.5‑Flash: Lightning‑Fast Inference with 196B Params, Only 11B Active (vLLM)

Alibaba Cloud Developer

Jan 26, 2026 · Artificial Intelligence

How We Scaled a 3.5B MoE LLM for Real‑Time Search Relevance

This article details the engineering challenges and solutions for deploying a 3.5 billion‑parameter MoE LLM in Taobao's search relevance pipeline, covering large‑batch scheduling, dynamic load balancing, intra‑batch KV‑Cache reuse, and MoE kernel tuning to meet sub‑second latency requirements.

Inference OptimizationKV cacheLLM

0 likes · 15 min read

How We Scaled a 3.5B MoE LLM for Real‑Time Search Relevance

Data Party THU

Jan 13, 2026 · Artificial Intelligence

How Engram’s ‘Lookup‑Compute Separation’ Boosts LLM Performance

DeepSeek’s newly open‑sourced Engram module introduces a scalable lookup‑based memory that separates knowledge retrieval from computation, enabling O(1) deterministic access and significantly improving large language model performance on knowledge‑heavy, reasoning, code, and math tasks without extra FLOPs.

@LookupLLMMemory Architecture

0 likes · 10 min read

How Engram’s ‘Lookup‑Compute Separation’ Boosts LLM Performance

AI Insight Log

Dec 18, 2025 · Artificial Intelligence

Xiaomi’s New MiMo‑V2‑Flash LLM Rivals DeepSeek‑V3.2 and Near‑GPT‑5 High

Xiaomi’s MiMo‑V2‑Flash, a 309B‑parameter MoE LLM with only 15B active weights, uses Hybrid SWA, Multi‑Token Prediction and Multi‑Teacher On‑Policy Distillation to cut KV‑cache by six times, boost inference speed 2.6×, and achieve performance comparable to DeepSeek‑V3.2, Kimi‑K2 and near‑GPT‑5 High, including a 73.4% SWE‑Bench code‑agent score.

Hybrid SWALarge Language ModelMOPD

0 likes · 7 min read

Xiaomi’s New MiMo‑V2‑Flash LLM Rivals DeepSeek‑V3.2 and Near‑GPT‑5 High

Architect

Dec 15, 2025 · Artificial Intelligence

Demystifying LLM Architecture: From Transformers to Modern MoE Designs

This comprehensive guide explains the fundamentals of large language model (LLM) architectures, covering the original Transformer, tokenization, embeddings, positional encoding, attention mechanisms, feed‑forward networks, layer stacking, a step‑by‑step translation example, and the latest open‑source and hybrid LLM designs shaping the field.

EmbeddingLLMMoE

0 likes · 41 min read

Demystifying LLM Architecture: From Transformers to Modern MoE Designs

Architects' Tech Alliance

Oct 24, 2025 · Artificial Intelligence

How xPU Scale‑Up Networks Are Redefining AI Training Efficiency

As AI models grow to massive scales, the demand for ultra‑high‑performance, low‑latency networking in xPU clusters intensifies, prompting a shift from dense to MoE architectures and driving the evolution of Scale‑up networks, where Alibaba Cloud’s UPN design tackles bandwidth, cost, and reliability challenges.

AIMoENetwork

0 likes · 13 min read

How xPU Scale‑Up Networks Are Redefining AI Training Efficiency

Alibaba Cloud Big Data AI Platform

Sep 25, 2025 · Artificial Intelligence

Unlocking Trillion‑Parameter MoE Models: Expert Parallelism and Alibaba Cloud PAI‑EAS Deployment Guide

This article explains the opportunities and challenges of Mixture of Experts (MoE) models, introduces expert parallelism as a solution to scaling and deployment bottlenecks, and provides a step‑by‑step guide for deploying MoE models with Alibaba Cloud PAI‑EAS, including configuration tips and code examples.

AI model deploymentExpert ParallelismLarge Language Model

0 likes · 11 min read

Unlocking Trillion‑Parameter MoE Models: Expert Parallelism and Alibaba Cloud PAI‑EAS Deployment Guide

AntTech

Sep 14, 2025 · Artificial Intelligence

Ring-mini-2.0: How a 16B MoE Model Delivers 128K Context and 500+ Tokens/s

Ring-mini-2.0 is a high‑performance inference MoE model that activates only 1.4 B parameters out of 16 B total, achieving dense‑model quality below 10 B while supporting 128 K context length and ultra‑fast generation speeds of over 300 tokens/s.

AIInference OptimizationMoE

0 likes · 4 min read

Ring-mini-2.0: How a 16B MoE Model Delivers 128K Context and 500+ Tokens/s

AntTech

Sep 13, 2025 · Artificial Intelligence

LLaDA‑MoE: The First Native MoE Diffusion Language Model Shattering Autoregressive Limits

Ant Group and Renmin University unveiled LLaDA‑MoE, the industry’s first native MoE‑based diffusion language model trained on 20 TB of data, achieving performance comparable to Qwen2.5 while delivering several‑fold faster inference, and the model will be fully open‑sourced to accelerate global AI research.

AI researchDiffusion language modelLLaDA-MoE

0 likes · 6 min read

LLaDA‑MoE: The First Native MoE Diffusion Language Model Shattering Autoregressive Limits

AntTech

Sep 11, 2025 · Artificial Intelligence

Ling-mini-2.0: How a 16B MoE Model Achieves Dense-Level Performance with Only 1.4B Active Parameters

Ling-mini-2.0, an open-source 16 B MoE language model that activates only 1.4 B parameters, achieves dense-level performance with 7× efficiency, generates over 300 tokens / s, and introduces the first FP8 mixed-precision training suite, offering multiple pre-training checkpoints for the AI community.

Efficient InferenceFP8 trainingMoE

0 likes · 6 min read

Ling-mini-2.0: How a 16B MoE Model Achieves Dense-Level Performance with Only 1.4B Active Parameters

Baidu Intelligent Cloud Tech Hub

Sep 4, 2025 · Artificial Intelligence

Unlocking MoE Model Power: Baidu’s Baige 5.0 AI Platform’s FP8 and Distributed Innovations

Baidu’s Baige 5.0 AI Computing Platform introduces FP8 mixed‑precision training, MoE‑aware distributed strategies, adaptive parallelism, and a three‑tier KV‑Cache, delivering over 30% training speedup and 50% inference throughput gains while keeping token latency under half a second for large‑scale models.

AIFP8MoE

0 likes · 16 min read

Unlocking MoE Model Power: Baidu’s Baige 5.0 AI Platform’s FP8 and Distributed Innovations

Data Party THU

Aug 11, 2025 · Artificial Intelligence

What Sets the Latest LLMs Apart? A Deep Dive into V3, OLMo, Gemma, Mistral, Llama 4 and More

This article systematically compares the architectures of recent large language models—including DeepSeek V3/R1, OLMo 2, Gemma 3, Mistral Small 3.1, Llama 4, Qwen 3, SmolLM 3 and Kimi 2—highlighting innovations such as MLA, MoE, post‑norm, sliding‑window attention, NoPE and optimizer choices, with diagrams and code examples to illustrate their impact on efficiency and performance.

ComparisonLLMMLA

0 likes · 12 min read

What Sets the Latest LLMs Apart? A Deep Dive into V3, OLMo, Gemma, Mistral, Llama 4 and More

AI Algorithm Path

Jul 29, 2025 · Artificial Intelligence

Why GLM‑4.5 Sets a New Benchmark for Open‑Source Large Language Models

GLM‑4.5 and its lightweight Air variant, featuring a deep‑layered MoE design, grouped‑query attention, and dual inference modes, achieve third‑place overall on 12 hard‑core benchmarks, excel in web‑browsing and tool‑calling with a 90.6 % success rate, and introduce novel training tricks such as the Muon optimizer and Slime RL framework.

AIGLM-4.5Large Language Model

0 likes · 8 min read

Why GLM‑4.5 Sets a New Benchmark for Open‑Source Large Language Models

Tech Freedom Circle

Jul 17, 2025 · Artificial Intelligence

DeepSeek V3 Architecture Deep Dive: MoE, MLA, DualPipe, FP8 Mixed Precision & Multi‑Token Prediction

This article provides a detailed technical analysis of DeepSeek‑V3, covering its MOE architecture, the novel Multi‑head Latent Attention (MLA) mechanism, the DualPipe pipeline‑parallel algorithm, mixed‑precision FP8 training, and the Multi‑Token Prediction (MTP) inference improvements that together boost performance and efficiency.

DeepSeekDualPipeFP8

0 likes · 44 min read

DeepSeek V3 Architecture Deep Dive: MoE, MLA, DualPipe, FP8 Mixed Precision & Multi‑Token Prediction

21CTO

Jul 1, 2025 · Artificial Intelligence

OpenAI CEO Warns: Don’t Blindly Trust AI – Insights from New Open‑Source Models

Sam Altman cautions against over‑reliance on ChatGPT, while Germany blocks DeepSeek for GDPR violations, Tencent unveils its MoE‑based Hunyuan‑A13B model, and Google releases a Python client for Data Commons, highlighting both AI risks and rapid open‑source advancements.

AI safetyData CommonsMoE

0 likes · 9 min read

OpenAI CEO Warns: Don’t Blindly Trust AI – Insights from New Open‑Source Models

DataFunTalk

Jun 30, 2025 · Artificial Intelligence

Wenxin 4.5 Series: Open‑Source Multimodal MoE Models and FastDeploy Guide

The Wenxin 4.5 series introduces ten open‑source models—including large‑scale MoE and dense variants—featuring a novel multimodal heterogeneous architecture, high training efficiency, SOTA benchmark performance, and comprehensive toolkits (ERNIEKit, FastDeploy) for fine‑tuning and multi‑hardware deployment.

ERNIEKitFastDeployMoE

0 likes · 8 min read

Wenxin 4.5 Series: Open‑Source Multimodal MoE Models and FastDeploy Guide

Smart Era Software Development

May 30, 2025 · Artificial Intelligence

How Tencent’s TRMT Boosted DeepSeek’s Communication: A Chinese Open‑Source Success

Tencent’s Star‑Network team partnered with DeepSeek to open‑source the DeepEP communication library, then used its self‑developed TRMT stack to overcome RoCE limitations, achieving up to 100% speedup on RoCEv2 and 30% on InfiniBand, cutting training costs and inference latency for large MoE models.

AI trainingDeepEPDeepSeek

0 likes · 8 min read

How Tencent’s TRMT Boosted DeepSeek’s Communication: A Chinese Open‑Source Success

AI Algorithm Path

May 9, 2025 · Artificial Intelligence

A Visual Guide to Mixture of Experts (MoE) Architecture in Large Language Models

This article explains the Mixture of Experts (MoE) technique used in modern LLMs, detailing its core components—experts and router—comparing dense and sparse layers, describing load‑balancing, expert capacity, and routing strategies, and showcasing real‑world examples such as Switch Transformer, Vision‑MoE, and Mixtral 8x7B.

Expert CapacityLLMMixture of Experts

0 likes · 15 min read

A Visual Guide to Mixture of Experts (MoE) Architecture in Large Language Models

AI2ML AI to Machine Learning

Apr 17, 2025 · Artificial Intelligence

Inside Qwen: A Deep Dive into the Large Model’s Source Code

The article provides a comprehensive technical walkthrough of Qwen’s large‑model series, covering data preparation, tokenization, model tweaks, training settings, RLHF pipeline, Code‑Qwen specifics, Qwen2 and Qwen3 architectural changes, scaling‑law experiments, and detailed source‑code analysis with illustrative diagrams.

Large Language ModelMoEQwen

0 likes · 7 min read

Inside Qwen: A Deep Dive into the Large Model’s Source Code

Architect

Mar 10, 2025 · Artificial Intelligence

What Makes DeepSeek’s New Architecture a Game‑Changer? Inside MLA, GRPO, and MoE Innovations

This article analyzes DeepSeek’s latest large‑model breakthroughs, covering the MLA attention compression, GRPO alignment algorithm, MoE load‑balancing redesign, multi‑stage training pipelines, reinforcement‑learning tricks, and performance comparisons with GPT‑4o‑Mini and Llama 3.1, highlighting both strengths and remaining challenges.

AI trainingDeepSeekGRPO

0 likes · 19 min read

Architect

Mar 2, 2025 · Artificial Intelligence

Demystifying Mixture of Experts: How MoE Boosts LLMs and Vision Models

This article explains the Mixture of Experts (MoE) architecture, detailing experts, routers, dense vs. sparse layers, load‑balancing strategies such as KeepTopK, auxiliary loss, capacity constraints, the Switch Transformer simplification, and how MoE is applied to both language and vision models, illustrated with concrete examples and parameter counts.

Mixture of ExpertsMoESwitch Transformer

0 likes · 17 min read

Demystifying Mixture of Experts: How MoE Boosts LLMs and Vision Models

Architects' Tech Alliance

Feb 27, 2025 · Artificial Intelligence

How Inspur Metabrain R1 Server Enables 1000+ Concurrent Users for DeepSeek 671B via SGLang Optimization

The Inspur Metabrain R1 inference server, equipped with FP8 acceleration and a 1128 GB HBM3e memory pool, has been tightly integrated with SGLang 0.4.3 to run the 671‑billion‑parameter DeepSeek R1 model, delivering over 1,000 concurrent user sessions and up to 3,976 tokens/s throughput.

AI serverDeepSeekInference Optimization

0 likes · 5 min read

How Inspur Metabrain R1 Server Enables 1000+ Concurrent Users for DeepSeek 671B via SGLang Optimization

DataFunTalk

Feb 26, 2025 · Artificial Intelligence

DeepGEMM: An Open‑Source FP8 GEMM Library for Efficient AI Model Training and Inference

DeepGEMM is an open‑source FP8‑precision GEMM library that delivers up to 1350 TFLOPS on NVIDIA Hopper GPUs, offering JIT‑compiled, lightweight code (~300 lines) for dense and MoE matrix multiplication, with easy deployment, configurable environment variables, and performance advantages over CUTLASS for large AI models.

AI accelerationDeepGEMMFP8

0 likes · 7 min read

DeepGEMM: An Open‑Source FP8 GEMM Library for Efficient AI Model Training and Inference

Architect

Feb 16, 2025 · Artificial Intelligence

DeepSeek-V3, DeepSeek-R1, and Janus‑Pro: Architecture, Training Techniques, and Performance Insights

This article provides an in‑depth technical overview of DeepSeek‑V3, DeepSeek‑R1 and Janus‑Pro models, covering their Mixture‑of‑Experts architecture, novel MLA attention, auxiliary‑loss‑free load balancing, multi‑token prediction, FP8 mixed‑precision training, efficient cross‑node communication, reinforcement‑learning pipelines, multimodal modeling strategies, performance comparisons, cost statistics, and current limitations.

AI ArchitectureDeepSeek-V3FP8 training

0 likes · 18 min read

DeepSeek-V3, DeepSeek-R1, and Janus‑Pro: Architecture, Training Techniques, and Performance Insights

Java Captain

Feb 7, 2025 · Artificial Intelligence

DeepSeek: Disruptive Innovations in Large Language Model Architecture, Efficiency, and Ecosystem

DeepSeek reshapes the AI landscape by replacing brute‑force compute scaling with algorithmic breakthroughs such as a novel MoE architecture, memory compression, active‑learning data pipelines, and open‑source tooling, delivering dramatically lower training and inference costs while enabling edge deployment and a vibrant developer ecosystem.

Algorithmic EfficiencyDeepSeekEdge deployment

0 likes · 11 min read

DeepSeek: Disruptive Innovations in Large Language Model Architecture, Efficiency, and Ecosystem

CSS Magic

May 13, 2024 · Artificial Intelligence

DeepSeek: China’s New LLM Dark Horse – First Impressions and Shockingly Low Prices

The article evaluates DeepSeek v2, a 100‑billion‑parameter MoE model, highlighting its near‑GPT‑4 benchmark performance, OpenAI‑compatible API, 32k‑token context, exceptionally low pricing, a custom token‑utilization metric, and the practical drawbacks observed during hands‑on testing.

API compatibilityDeepSeekLarge Language Model

0 likes · 9 min read

DeepSeek: China’s New LLM Dark Horse – First Impressions and Shockingly Low Prices

Baobao Algorithm Notes

Mar 28, 2024 · Artificial Intelligence

How Qwen1.5‑MoE‑A2.7B Matches 70B LLM Performance with Only 2.7B Activated Parameters

Qwen1.5‑MoE‑A2.7B is a 2.7 billion‑parameter Mixture‑of‑Experts model that delivers performance comparable to leading 7 billion‑parameter LLMs while cutting training cost by 75% and boosting inference speed by 1.74×, and the article details its architecture, benchmarks, efficiency analysis, and deployment steps.

Large Language ModelMoEModel Benchmark

0 likes · 13 min read

How Qwen1.5‑MoE‑A2.7B Matches 70B LLM Performance with Only 2.7B Activated Parameters

Alibaba Cloud Big Data AI Platform

Mar 26, 2024 · Artificial Intelligence

MoE LLMs: How Alibaba Cloud & NVIDIA Megatron-Core Accelerate Training

This article reviews the evolution of Mixture-of-Experts (MoE) models, details Alibaba Cloud’s collaboration with NVIDIA’s Megatron-Core to build a high-performance MoE framework, and presents extensive training optimizations, benchmark results, conversion tools, and best-practice guidelines for large-scale LLM development and deployment.

Alibaba CloudMegatron-CoreMoE

0 likes · 18 min read

MoE LLMs: How Alibaba Cloud & NVIDIA Megatron-Core Accelerate Training

Java Tech Enthusiast

Feb 16, 2024 · Artificial Intelligence

Google's Gemini 1.5: Breakthrough in Long-Context Understanding and Multimodal Capabilities

Google’s Gemini 1.5, a new multimodal Mixture‑of‑Experts model, supports up to a million‑token context (10 million internally), can understand text, video, audio and code, learns a new language from a single prompt, and is already being used by Samsung, Jasper and Quora, positioning it as a direct challenger to OpenAI’s flagship models.

Gemini 1.5Google AILLM

0 likes · 7 min read

Google's Gemini 1.5: Breakthrough in Long-Context Understanding and Multimodal Capabilities

DataFunTalk

Aug 12, 2022 · Artificial Intelligence

Multi‑Task Learning for Sample Selection Bias in Financial Risk Control

This article presents a comprehensive study on addressing sample selection bias in credit risk modeling by applying multi‑task learning techniques, including MoE/MMoE, ESMM, hierarchical attention, and semi‑supervised loss, and demonstrates their effectiveness through two real‑world application cases and experimental results.

MoEfinancial AIrisk control

0 likes · 14 min read

Multi‑Task Learning for Sample Selection Bias in Financial Risk Control