Tagged articles
53 articles
Page 1 of 1
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 16, 2026 · Artificial Intelligence

Token Superposition Training Accelerates LLM Pre‑training 2.5× Without Changing Architecture

Token Superposition Training (TST) speeds up large‑language‑model pre‑training by up to 2.5× without altering model architecture or compute budget, using a superposition phase that averages token embeddings into bags and predicts groups of tokens, followed by a standard recovery phase, as demonstrated on 10B‑parameter MoE and smaller models.

LLM PretrainingMCE LossMoE
0 likes · 10 min read
Token Superposition Training Accelerates LLM Pre‑training 2.5× Without Changing Architecture
SuanNi
SuanNi
May 12, 2026 · Artificial Intelligence

AntAngelMed: 6.1B‑Activated MoE Model Tops Three Medical Benchmarks

AntAngelMed, a 100‑billion‑parameter medical LLM using a 6.1 billion‑parameter MoE architecture, achieves performance comparable to a 40 billion‑parameter dense model, exceeds 200 tokens/s inference speed, and ranks first on HealthBench, MedAIBench and MedBench, with a three‑stage training pipeline and extensive efficiency optimizations.

HealthBenchMedAIBenchMedBench
0 likes · 6 min read
AntAngelMed: 6.1B‑Activated MoE Model Tops Three Medical Benchmarks
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 6, 2026 · Artificial Intelligence

Why DeepSeek‑V4’s MFU Drops: Parallel Strategies and Compute‑Communication Overlap

The article dissects DeepSeek‑V4’s shift from dense to MoE models, explains why MFU plummets despite sufficient expert dimensions, and details how a carefully designed GPU parallel strategy—combining DP, ZeRO‑1, PP, EP and the new Waved‑EP kernel—overlaps communication and computation to reclaim throughput on 8‑card NVLink nodes linked by InfiniBand.

DeepSeek-V4Expert ParallelGPU Distributed Training
0 likes · 19 min read
Why DeepSeek‑V4’s MFU Drops: Parallel Strategies and Compute‑Communication Overlap
Architects' Tech Alliance
Architects' Tech Alliance
May 4, 2026 · Artificial Intelligence

DeepSeek‑V4 Inference Cost Showdown: NVIDIA H100 vs Ascend 950PR vs 910C

DeepSeek‑V4, a 1.6‑trillion‑parameter MoE model with mixed‑precision attention, is benchmarked on three accelerators—NVIDIA H100, Huawei Ascend 910C, and Ascend 950PR—showing that the 950PR delivers the lowest per‑token cost in both Prefill and Decode phases, while the H100 offers the highest raw performance at a far greater price.

DeepSeek-V4FP8Huawei Ascend 950PR
0 likes · 8 min read
DeepSeek‑V4 Inference Cost Showdown: NVIDIA H100 vs Ascend 950PR vs 910C
Architects' Tech Alliance
Architects' Tech Alliance
May 2, 2026 · Artificial Intelligence

Eight Chinese AI Chips Achieve Day‑Zero DeepSeek‑V4 Compatibility

The article explains how eight domestic AI chip makers—Huawei Ascend, Cambricon, HaiGuang, Moore Threads, Kunlun, Pingtouge, Muxi, and Tianshu—simultaneously completed full‑link compatibility, performance tuning, and stability verification for DeepSeek‑V4 on release day, detailing each vendor’s technical path, shared ecosystem breakthroughs, and the broader impact on the AI industry.

AI chipsDay0 adaptationDeepSeek-V4
0 likes · 11 min read
Eight Chinese AI Chips Achieve Day‑Zero DeepSeek‑V4 Compatibility
Architects' Tech Alliance
Architects' Tech Alliance
May 1, 2026 · Artificial Intelligence

How DeepSeek V4 Triggers a Global AI Price War with OpenAI

DeepSeek V4’s open‑source 1 M‑token MoE model delivers benchmark scores of MMLU 88.7, C‑Eval 92.1 and HumanEval 69.5, while its 4‑bit AWQ quantization, PagedAttention memory management and FlashAttention acceleration cut inference costs and latency, prompting rivals such as Anthropic, OpenAI, Baidu and Huawei to slash prices and boost efficiency in a fierce market battle.

AI efficiencyDeepSeek-V4MoE
0 likes · 9 min read
How DeepSeek V4 Triggers a Global AI Price War with OpenAI
DataFunTalk
DataFunTalk
Apr 28, 2026 · Artificial Intelligence

Manifold AI’s WorldScape 0.2 Tops WorldArena: How MoE Drives Superior Physics and 3D Understanding

Manifold AI’s WorldScape 0.2 achieved the highest overall score on the embodied world‑model benchmark WorldArena, outperforming giants like Google and Nvidia by excelling in comprehensive perception, physics compliance, and 3D accuracy while using only about 10 % of the parameters of competing models, thanks to a newly introduced MoE architecture.

BenchmarkEmbodied AIMoE
0 likes · 9 min read
Manifold AI’s WorldScape 0.2 Tops WorldArena: How MoE Drives Superior Physics and 3D Understanding
DeepHub IMBA
DeepHub IMBA
Apr 27, 2026 · Artificial Intelligence

DeepSeek‑V4 Deep Dive: Engineering Million‑Token Context Efficiency

The article provides a thorough technical analysis of DeepSeek‑V4, detailing how mixed sparse attention (CSA + HCA), manifold‑constrained hyper‑connections, the Muon optimizer, FP4 quantization, and a suite of infrastructure tricks enable stable training and inference with up to one‑million token contexts while achieving state‑of‑the‑art benchmark results.

CSADeepSeek-V4FP4 quantization
0 likes · 22 min read
DeepSeek‑V4 Deep Dive: Engineering Million‑Token Context Efficiency
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 25, 2026 · Artificial Intelligence

Why DeepSeek‑V4 Took Twice as Long: Inside the Training‑Stability Challenges and Engineering Hacks

The DeepSeek‑V4 technical report reveals that the model’s doubled training time stems from massive token and parameter scaling, severe training‑stability issues in MoE layers, and a suite of engineering solutions—including Anticipatory Routing, SwiGLU Clamping, specialist expert training, and a custom sandbox cluster—while also exposing high hallucination rates despite impressive benchmark performance.

BenchmarkDeepSeek-V4Generative Reward Model
0 likes · 12 min read
Why DeepSeek‑V4 Took Twice as Long: Inside the Training‑Stability Challenges and Engineering Hacks
Architects' Tech Alliance
Architects' Tech Alliance
Apr 24, 2026 · Artificial Intelligence

DeepSeek V4 Launches with 1M‑Token Context, Dual Versions and Native Chinese Chip Support

On April 24, 2026 DeepSeek released the V4 preview featuring two models—V4‑Pro with a 1.6 T‑parameter MoE architecture and V4‑Flash with 284 B parameters—both offering 1 million token context, up to 384 K output tokens, new step‑wise reasoning modes, and full native compatibility with Huawei Ascend and Cambricon chips, while delivering major efficiency gains and benchmark‑leading performance.

1M token contextCambriconDeepSeek
0 likes · 7 min read
DeepSeek V4 Launches with 1M‑Token Context, Dual Versions and Native Chinese Chip Support
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 23, 2026 · Artificial Intelligence

DeepSeek Quietly Open‑Sources TileKernels to Push GPU Performance to Its Limits

DeepSeek has released TileKernels, a GPU kernel library written in the TileLang DSL, that targets H100/H200/B200 GPUs and claims to approach hardware limits in compute intensity and memory bandwidth, offering MoE routing, FP8/FP4 quantization, and dual‑language PyTorch references for deep‑learning engineers.

FP8 quantizationGPU OptimizationLLM training
0 likes · 9 min read
DeepSeek Quietly Open‑Sources TileKernels to Push GPU Performance to Its Limits
PaperAgent
PaperAgent
Apr 22, 2026 · Artificial Intelligence

Alibaba Unveils Four New Open‑Source Qwen3.6 Models: 27B Dense and 35B‑A3B MoE

Alibaba has added four new open‑source weight versions to its Qwen3.6 series, featuring the 27‑billion‑parameter dense multimodal model Qwen3.6‑27B and the 35‑billion‑parameter sparse expert model Qwen3.6‑35B‑A3B, both designed for stable, real‑world coding tasks and outperforming their Qwen3.5 predecessors.

AI agentsAlibabaDense Model
0 likes · 4 min read
Alibaba Unveils Four New Open‑Source Qwen3.6 Models: 27B Dense and 35B‑A3B MoE
PaperAgent
PaperAgent
Apr 21, 2026 · Artificial Intelligence

OpenMythos: Rebuilding Claude Mythos with Recursive Transformers and MoE

OpenMythos is an open‑source PyTorch reimplementation of Anthropic's Claude Mythos that uses a mixed‑expert routed recurrent Transformer, introduces Recursive Depth Transformers, Multi‑Latent Attention, and several stability mechanisms, and demonstrates parameter‑efficient scaling backed by empirical studies.

AI ArchitectureClaude MythosMoE
0 likes · 6 min read
OpenMythos: Rebuilding Claude Mythos with Recursive Transformers and MoE
HyperAI Super Neural
HyperAI Super Neural
Apr 21, 2026 · Artificial Intelligence

Qwen3.6-35B-A3B Boosts Agent Programming: 3B Activation Beats Gemma4-31B

Qwen3.6-35B-A3B, the first open‑source Qwen3.6 model, achieves markedly better scores than Qwen3.5‑35B‑A3B and Gemma4‑31B on Terminal‑Bench2.0, NL2Repo, and QwenClawBench, adds a thought‑process retention option, and is accessible via HyperAI’s ready‑to‑run notebook with free compute credits.

Agent ProgrammingBenchmarkHyperAI
0 likes · 4 min read
Qwen3.6-35B-A3B Boosts Agent Programming: 3B Activation Beats Gemma4-31B
Lao Guo's Learning Space
Lao Guo's Learning Space
Apr 8, 2026 · Artificial Intelligence

2026 Qwen Model Comparison: Choose the Right Qwen for Your Mac Studio

An in‑depth 2026 comparative review of Alibaba’s Qwen series (Qwen2.5, Qwen3, Qwen3.5) evaluates architecture, performance, speed and VRAM usage on Mac Studio, ranks each variant, and provides concrete model‑selection guidance for different memory configurations, highlighting the MoE‑based Qwen3.5 as the optimal choice.

AI PerformanceMac StudioMoE
0 likes · 9 min read
2026 Qwen Model Comparison: Choose the Right Qwen for Your Mac Studio
ShiZhen AI
ShiZhen AI
Mar 17, 2026 · Artificial Intelligence

Kimi’s Attention Residuals Swap a Decade-Old Residual Trick for 1.25× Faster 48B MoE

The Kimi team introduces Attention Residuals, a softmax‑based replacement for the uniform residual connections used in Transformers for a decade, enabling selective aggregation of layer histories, reducing hidden‑state growth, and achieving a 1.25× compute‑efficiency gain on a 48‑billion‑parameter MoE model with less than 2% inference latency increase.

Attention ResidualsCompute EfficiencyDeep Learning
0 likes · 10 min read
Kimi’s Attention Residuals Swap a Decade-Old Residual Trick for 1.25× Faster 48B MoE
Old Zhang's AI Learning
Old Zhang's AI Learning
Mar 13, 2026 · Artificial Intelligence

Nvidia’s New OpenClaw‑Optimized Model Cracks Top‑5 on PinchBench – Free to Use

Nvidia’s open‑source Nemotron‑3‑Super model achieves an 85.6% success rate on the PinchBench OpenClaw benchmark, ranking in the top five (the only open‑source entry), and the article explains its architecture, quantization, training pipeline, performance numbers, usage options, and practical limitations.

AI coding agentMoENVFP4
0 likes · 10 min read
Nvidia’s New OpenClaw‑Optimized Model Cracks Top‑5 on PinchBench – Free to Use
Old Zhang's AI Learning
Old Zhang's AI Learning
Feb 26, 2026 · Artificial Intelligence

Ultimate Guide to Local Deployment of Qwen3.5 Models (27B‑397B)

This guide reviews the Qwen3.5 model lineup, explains mixed‑inference and MoE architecture, presents benchmark comparisons with GPT‑5.2, Claude 4.5 and Gemini‑3 Pro, evaluates 4‑bit and 3‑bit quantization loss, outlines hardware requirements, and provides step‑by‑step deployment options using llama.cpp or llama‑server.

InferenceMoElarge language model
0 likes · 14 min read
Ultimate Guide to Local Deployment of Qwen3.5 Models (27B‑397B)
Baobao Algorithm Notes
Baobao Algorithm Notes
Feb 25, 2026 · Artificial Intelligence

Exploring Qwen 3.5: Small‑Scale MoE Models, Architecture, and Deployment Guides

This article reviews the three open‑source Qwen 3.5 models—including a 35B MoE, a 122B MoE, and a 27B dense version—detailing their parameter layouts, core attention designs, context length, inference performance, hardware requirements, and provides step‑by‑step code examples for loading them with Hugging Face Transformers and vLLM.

AIMoEModel Deployment
0 likes · 10 min read
Exploring Qwen 3.5: Small‑Scale MoE Models, Architecture, and Deployment Guides
Old Zhang's AI Learning
Old Zhang's AI Learning
Feb 19, 2026 · Artificial Intelligence

Inside GLM-5: Training Techniques, Architecture Innovations, and Benchmark Performance

The article dissects GLM-5’s 744B‑parameter MoE design, 28.5 T token training corpus, novel Muon Split and MLA‑256 optimizations, DSA sparse attention, a fully asynchronous RL pipeline, extensive domestic chip adaptation, and benchmark results that place it on par with Claude Opus 4.5 and ahead of Gemini 3 Pro.

AI ArchitectureBenchmarkDSA
0 likes · 13 min read
Inside GLM-5: Training Techniques, Architecture Innovations, and Benchmark Performance
Node.js Tech Stack
Node.js Tech Stack
Feb 16, 2026 · Artificial Intelligence

Qwen 3.5 Launch: 17B Active Parameters Take on GPT‑5.2

Qwen 3.5, an open‑source 397B‑parameter model that activates only 17B parameters, uses a hybrid MoE‑Gated Delta architecture, offers native multimodal support and a default chain‑of‑thought mode, and achieves benchmark scores comparable to GPT‑5.2, Claude 4.5 Opus and Gemini 3 Pro across code, math, agent and vision tasks.

AI modelBenchmarkGated Delta Networks
0 likes · 9 min read
Qwen 3.5 Launch: 17B Active Parameters Take on GPT‑5.2
Old Zhang's AI Learning
Old Zhang's AI Learning
Feb 9, 2026 · Artificial Intelligence

GLM-5 Emerges First, Built on DeepSeek Tech, Triggering a 40% Stock Surge

An anonymous OpenRouter model dubbed "Pony Alpha" was verified as the new 745B‑parameter GLM-5, which reuses DeepSeek‑V3 architecture, supports sparse attention and multi‑token prediction, and has already caused a near‑40% jump in Zhipu AI’s stock while hinting at upcoming integration into the Transformers library.

DeepSeekGLM-5MoE
0 likes · 3 min read
GLM-5 Emerges First, Built on DeepSeek Tech, Triggering a 40% Stock Surge
Old Zhang's AI Learning
Old Zhang's AI Learning
Feb 3, 2026 · Artificial Intelligence

Step‑3.5‑Flash: Lightning‑Fast Inference with 196B Params, Only 11B Active (vLLM)

Step‑3.5‑Flash, a 196‑billion‑parameter open‑source LLM that activates only 11 B per token via a Mixture‑of‑Experts design, delivers 3‑plus‑times faster inference, matches top‑tier closed‑source models on SWE‑bench and other benchmarks, supports 256 K context, runs on consumer‑grade hardware, and is already integrated into vLLM, SGLang, and Claude Code, though it has known token‑efficiency and domain‑stability limitations.

LLM BenchmarkMoEMulti-token Prediction
0 likes · 11 min read
Step‑3.5‑Flash: Lightning‑Fast Inference with 196B Params, Only 11B Active (vLLM)
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 26, 2026 · Artificial Intelligence

How We Scaled a 3.5B MoE LLM for Real‑Time Search Relevance

This article details the engineering challenges and solutions for deploying a 3.5 billion‑parameter MoE LLM in Taobao's search relevance pipeline, covering large‑batch scheduling, dynamic load balancing, intra‑batch KV‑Cache reuse, and MoE kernel tuning to meet sub‑second latency requirements.

Inference OptimizationKV cacheLLM
0 likes · 15 min read
How We Scaled a 3.5B MoE LLM for Real‑Time Search Relevance
Data Party THU
Data Party THU
Jan 13, 2026 · Artificial Intelligence

How Engram’s ‘Lookup‑Compute Separation’ Boosts LLM Performance

DeepSeek’s newly open‑sourced Engram module introduces a scalable lookup‑based memory that separates knowledge retrieval from computation, enabling O(1) deterministic access and significantly improving large language model performance on knowledge‑heavy, reasoning, code, and math tasks without extra FLOPs.

LLMLookupMemory Architecture
0 likes · 10 min read
How Engram’s ‘Lookup‑Compute Separation’ Boosts LLM Performance
AI Insight Log
AI Insight Log
Dec 18, 2025 · Artificial Intelligence

Xiaomi’s New MiMo‑V2‑Flash LLM Rivals DeepSeek‑V3.2 and Near‑GPT‑5 High

Xiaomi’s MiMo‑V2‑Flash, a 309B‑parameter MoE LLM with only 15B active weights, uses Hybrid SWA, Multi‑Token Prediction and Multi‑Teacher On‑Policy Distillation to cut KV‑cache by six times, boost inference speed 2.6×, and achieve performance comparable to DeepSeek‑V3.2, Kimi‑K2 and near‑GPT‑5 High, including a 73.4% SWE‑Bench code‑agent score.

Hybrid SWAMOPDMTP
0 likes · 7 min read
Xiaomi’s New MiMo‑V2‑Flash LLM Rivals DeepSeek‑V3.2 and Near‑GPT‑5 High
Architect
Architect
Dec 15, 2025 · Artificial Intelligence

Demystifying LLM Architecture: From Transformers to Modern MoE Designs

This comprehensive guide explains the fundamentals of large language model (LLM) architectures, covering the original Transformer, tokenization, embeddings, positional encoding, attention mechanisms, feed‑forward networks, layer stacking, a step‑by‑step translation example, and the latest open‑source and hybrid LLM designs shaping the field.

EmbeddingLLMMoE
0 likes · 41 min read
Demystifying LLM Architecture: From Transformers to Modern MoE Designs
Architects' Tech Alliance
Architects' Tech Alliance
Oct 24, 2025 · Artificial Intelligence

How xPU Scale‑Up Networks Are Redefining AI Training Efficiency

As AI models grow to massive scales, the demand for ultra‑high‑performance, low‑latency networking in xPU clusters intensifies, prompting a shift from dense to MoE architectures and driving the evolution of Scale‑up networks, where Alibaba Cloud’s UPN design tackles bandwidth, cost, and reliability challenges.

AIMoEScale‑Up
0 likes · 13 min read
How xPU Scale‑Up Networks Are Redefining AI Training Efficiency
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Sep 25, 2025 · Artificial Intelligence

Unlocking Trillion‑Parameter MoE Models: Expert Parallelism and Alibaba Cloud PAI‑EAS Deployment Guide

This article explains the opportunities and challenges of Mixture of Experts (MoE) models, introduces expert parallelism as a solution to scaling and deployment bottlenecks, and provides a step‑by‑step guide for deploying MoE models with Alibaba Cloud PAI‑EAS, including configuration tips and code examples.

AI Model DeploymentExpert ParallelismMoE
0 likes · 11 min read
Unlocking Trillion‑Parameter MoE Models: Expert Parallelism and Alibaba Cloud PAI‑EAS Deployment Guide
AntTech
AntTech
Sep 14, 2025 · Artificial Intelligence

Ring-mini-2.0: How a 16B MoE Model Delivers 128K Context and 500+ Tokens/s

Ring-mini-2.0 is a high‑performance inference MoE model that activates only 1.4 B parameters out of 16 B total, achieving dense‑model quality below 10 B while supporting 128 K context length and ultra‑fast generation speeds of over 300 tokens/s.

AIInference OptimizationMoE
0 likes · 4 min read
Ring-mini-2.0: How a 16B MoE Model Delivers 128K Context and 500+ Tokens/s
AntTech
AntTech
Sep 13, 2025 · Artificial Intelligence

LLaDA‑MoE: The First Native MoE Diffusion Language Model Shattering Autoregressive Limits

Ant Group and Renmin University unveiled LLaDA‑MoE, the industry’s first native MoE‑based diffusion language model trained on 20 TB of data, achieving performance comparable to Qwen2.5 while delivering several‑fold faster inference, and the model will be fully open‑sourced to accelerate global AI research.

AI researchDiffusion Language ModelLLaDA-MoE
0 likes · 6 min read
LLaDA‑MoE: The First Native MoE Diffusion Language Model Shattering Autoregressive Limits
AntTech
AntTech
Sep 11, 2025 · Artificial Intelligence

Ling-mini-2.0: How a 16B MoE Model Achieves Dense-Level Performance with Only 1.4B Active Parameters

Ling-mini-2.0, an open-source 16 B MoE language model that activates only 1.4 B parameters, achieves dense-level performance with 7× efficiency, generates over 300 tokens / s, and introduces the first FP8 mixed-precision training suite, offering multiple pre-training checkpoints for the AI community.

FP8 trainingMoEefficient inference
0 likes · 6 min read
Ling-mini-2.0: How a 16B MoE Model Achieves Dense-Level Performance with Only 1.4B Active Parameters
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Sep 4, 2025 · Artificial Intelligence

Unlocking MoE Model Power: Baidu’s Baige 5.0 AI Platform’s FP8 and Distributed Innovations

Baidu’s Baige 5.0 AI Computing Platform introduces FP8 mixed‑precision training, MoE‑aware distributed strategies, adaptive parallelism, and a three‑tier KV‑Cache, delivering over 30% training speedup and 50% inference throughput gains while keeping token latency under half a second for large‑scale models.

AIFP8Inference
0 likes · 16 min read
Unlocking MoE Model Power: Baidu’s Baige 5.0 AI Platform’s FP8 and Distributed Innovations
Data Party THU
Data Party THU
Aug 11, 2025 · Artificial Intelligence

What Sets the Latest LLMs Apart? A Deep Dive into V3, OLMo, Gemma, Mistral, Llama 4 and More

This article systematically compares the architectures of recent large language models—including DeepSeek V3/R1, OLMo 2, Gemma 3, Mistral Small 3.1, Llama 4, Qwen 3, SmolLM 3 and Kimi 2—highlighting innovations such as MLA, MoE, post‑norm, sliding‑window attention, NoPE and optimizer choices, with diagrams and code examples to illustrate their impact on efficiency and performance.

ComparisonLLMMLA
0 likes · 12 min read
What Sets the Latest LLMs Apart? A Deep Dive into V3, OLMo, Gemma, Mistral, Llama 4 and More
AI Algorithm Path
AI Algorithm Path
Jul 29, 2025 · Artificial Intelligence

Why GLM‑4.5 Sets a New Benchmark for Open‑Source Large Language Models

GLM‑4.5 and its lightweight Air variant, featuring a deep‑layered MoE design, grouped‑query attention, and dual inference modes, achieve third‑place overall on 12 hard‑core benchmarks, excel in web‑browsing and tool‑calling with a 90.6 % success rate, and introduce novel training tricks such as the Muon optimizer and Slime RL framework.

AIBenchmarkGLM-4.5
0 likes · 8 min read
Why GLM‑4.5 Sets a New Benchmark for Open‑Source Large Language Models
Tech Freedom Circle
Tech Freedom Circle
Jul 17, 2025 · Artificial Intelligence

DeepSeek V3 Architecture Deep Dive: MoE, MLA, DualPipe, FP8 Mixed Precision & Multi‑Token Prediction

This article provides a detailed technical analysis of DeepSeek‑V3, covering its MOE architecture, the novel Multi‑head Latent Attention (MLA) mechanism, the DualPipe pipeline‑parallel algorithm, mixed‑precision FP8 training, and the Multi‑Token Prediction (MTP) inference improvements that together boost performance and efficiency.

DeepSeekDistributed TrainingDualPipe
0 likes · 44 min read
DeepSeek V3 Architecture Deep Dive: MoE, MLA, DualPipe, FP8 Mixed Precision & Multi‑Token Prediction
21CTO
21CTO
Jul 1, 2025 · Artificial Intelligence

OpenAI CEO Warns: Don’t Blindly Trust AI – Insights from New Open‑Source Models

Sam Altman cautions against over‑reliance on ChatGPT, while Germany blocks DeepSeek for GDPR violations, Tencent unveils its MoE‑based Hunyuan‑A13B model, and Google releases a Python client for Data Commons, highlighting both AI risks and rapid open‑source advancements.

AI SafetyData CommonsMoE
0 likes · 9 min read
OpenAI CEO Warns: Don’t Blindly Trust AI – Insights from New Open‑Source Models
DataFunTalk
DataFunTalk
Jun 30, 2025 · Artificial Intelligence

Wenxin 4.5 Series: Open‑Source Multimodal MoE Models and FastDeploy Guide

The Wenxin 4.5 series introduces ten open‑source models—including large‑scale MoE and dense variants—featuring a novel multimodal heterogeneous architecture, high training efficiency, SOTA benchmark performance, and comprehensive toolkits (ERNIEKit, FastDeploy) for fine‑tuning and multi‑hardware deployment.

ERNIEKitFastDeployMoE
0 likes · 8 min read
Wenxin 4.5 Series: Open‑Source Multimodal MoE Models and FastDeploy Guide
AI Algorithm Path
AI Algorithm Path
May 9, 2025 · Artificial Intelligence

A Visual Guide to Mixture of Experts (MoE) Architecture in Large Language Models

This article explains the Mixture of Experts (MoE) technique used in modern LLMs, detailing its core components—experts and router—comparing dense and sparse layers, describing load‑balancing, expert capacity, and routing strategies, and showcasing real‑world examples such as Switch Transformer, Vision‑MoE, and Mixtral 8x7B.

Expert CapacityLLMMixture of Experts
0 likes · 15 min read
A Visual Guide to Mixture of Experts (MoE) Architecture in Large Language Models
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Apr 17, 2025 · Artificial Intelligence

Inside Qwen: A Deep Dive into the Large Model’s Source Code

The article provides a comprehensive technical walkthrough of Qwen’s large‑model series, covering data preparation, tokenization, model tweaks, training settings, RLHF pipeline, Code‑Qwen specifics, Qwen2 and Qwen3 architectural changes, scaling‑law experiments, and detailed source‑code analysis with illustrative diagrams.

MoEModel architectureQwen
0 likes · 7 min read
Inside Qwen: A Deep Dive into the Large Model’s Source Code
Architect
Architect
Mar 10, 2025 · Artificial Intelligence

What Makes DeepSeek’s New Architecture a Game‑Changer? Inside MLA, GRPO, and MoE Innovations

This article analyzes DeepSeek’s latest large‑model breakthroughs, covering the MLA attention compression, GRPO alignment algorithm, MoE load‑balancing redesign, multi‑stage training pipelines, reinforcement‑learning tricks, and performance comparisons with GPT‑4o‑Mini and Llama 3.1, highlighting both strengths and remaining challenges.

AI trainingDeepSeekGRPO
0 likes · 19 min read
What Makes DeepSeek’s New Architecture a Game‑Changer? Inside MLA, GRPO, and MoE Innovations
Architect
Architect
Mar 2, 2025 · Artificial Intelligence

Demystifying Mixture of Experts: How MoE Boosts LLMs and Vision Models

This article explains the Mixture of Experts (MoE) architecture, detailing experts, routers, dense vs. sparse layers, load‑balancing strategies such as KeepTopK, auxiliary loss, capacity constraints, the Switch Transformer simplification, and how MoE is applied to both language and vision models, illustrated with concrete examples and parameter counts.

Mixture of ExpertsMoESparse Models
0 likes · 17 min read
Demystifying Mixture of Experts: How MoE Boosts LLMs and Vision Models
Architects' Tech Alliance
Architects' Tech Alliance
Feb 27, 2025 · Artificial Intelligence

How Inspur Metabrain R1 Server Enables 1000+ Concurrent Users for DeepSeek 671B via SGLang Optimization

The Inspur Metabrain R1 inference server, equipped with FP8 acceleration and a 1128 GB HBM3e memory pool, has been tightly integrated with SGLang 0.4.3 to run the 671‑billion‑parameter DeepSeek R1 model, delivering over 1,000 concurrent user sessions and up to 3,976 tokens/s throughput.

AI serverDeepSeekInference Optimization
0 likes · 5 min read
How Inspur Metabrain R1 Server Enables 1000+ Concurrent Users for DeepSeek 671B via SGLang Optimization
DataFunTalk
DataFunTalk
Feb 26, 2025 · Artificial Intelligence

DeepGEMM: An Open‑Source FP8 GEMM Library for Efficient AI Model Training and Inference

DeepGEMM is an open‑source FP8‑precision GEMM library that delivers up to 1350 TFLOPS on NVIDIA Hopper GPUs, offering JIT‑compiled, lightweight code (~300 lines) for dense and MoE matrix multiplication, with easy deployment, configurable environment variables, and performance advantages over CUTLASS for large AI models.

AI accelerationDeepGEMMFP8
0 likes · 7 min read
DeepGEMM: An Open‑Source FP8 GEMM Library for Efficient AI Model Training and Inference
Architect
Architect
Feb 16, 2025 · Artificial Intelligence

DeepSeek-V3, DeepSeek-R1, and Janus‑Pro: Architecture, Training Techniques, and Performance Insights

This article provides an in‑depth technical overview of DeepSeek‑V3, DeepSeek‑R1 and Janus‑Pro models, covering their Mixture‑of‑Experts architecture, novel MLA attention, auxiliary‑loss‑free load balancing, multi‑token prediction, FP8 mixed‑precision training, efficient cross‑node communication, reinforcement‑learning pipelines, multimodal modeling strategies, performance comparisons, cost statistics, and current limitations.

AI ArchitectureDeepSeek-V3FP8 training
0 likes · 18 min read
DeepSeek-V3, DeepSeek-R1, and Janus‑Pro: Architecture, Training Techniques, and Performance Insights
Java Captain
Java Captain
Feb 7, 2025 · Artificial Intelligence

DeepSeek: Disruptive Innovations in Large Language Model Architecture, Efficiency, and Ecosystem

DeepSeek reshapes the AI landscape by replacing brute‑force compute scaling with algorithmic breakthroughs such as a novel MoE architecture, memory compression, active‑learning data pipelines, and open‑source tooling, delivering dramatically lower training and inference costs while enabling edge deployment and a vibrant developer ecosystem.

Algorithmic EfficiencyDeepSeekMoE
0 likes · 11 min read
DeepSeek: Disruptive Innovations in Large Language Model Architecture, Efficiency, and Ecosystem
CSS Magic
CSS Magic
May 13, 2024 · Artificial Intelligence

DeepSeek: China’s New LLM Dark Horse – First Impressions and Shockingly Low Prices

The article evaluates DeepSeek v2, a 100‑billion‑parameter MoE model, highlighting its near‑GPT‑4 benchmark performance, OpenAI‑compatible API, 32k‑token context, exceptionally low pricing, a custom token‑utilization metric, and the practical drawbacks observed during hands‑on testing.

API compatibilityBenchmarkDeepSeek
0 likes · 9 min read
DeepSeek: China’s New LLM Dark Horse – First Impressions and Shockingly Low Prices
Baobao Algorithm Notes
Baobao Algorithm Notes
Mar 28, 2024 · Artificial Intelligence

How Qwen1.5‑MoE‑A2.7B Matches 70B LLM Performance with Only 2.7B Activated Parameters

Qwen1.5‑MoE‑A2.7B is a 2.7 billion‑parameter Mixture‑of‑Experts model that delivers performance comparable to leading 7 billion‑parameter LLMs while cutting training cost by 75% and boosting inference speed by 1.74×, and the article details its architecture, benchmarks, efficiency analysis, and deployment steps.

MoEModel BenchmarkQwen
0 likes · 13 min read
How Qwen1.5‑MoE‑A2.7B Matches 70B LLM Performance with Only 2.7B Activated Parameters
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 26, 2024 · Artificial Intelligence

MoE LLMs: How Alibaba Cloud & NVIDIA Megatron-Core Accelerate Training

This article reviews the evolution of Mixture-of-Experts (MoE) models, details Alibaba Cloud’s collaboration with NVIDIA’s Megatron-Core to build a high-performance MoE framework, and presents extensive training optimizations, benchmark results, conversion tools, and best-practice guidelines for large-scale LLM development and deployment.

Alibaba CloudMegatron-CoreMoE
0 likes · 18 min read
MoE LLMs: How Alibaba Cloud & NVIDIA Megatron-Core Accelerate Training
Java Tech Enthusiast
Java Tech Enthusiast
Feb 16, 2024 · Artificial Intelligence

Google's Gemini 1.5: Breakthrough in Long-Context Understanding and Multimodal Capabilities

Google’s Gemini 1.5, a new multimodal Mixture‑of‑Experts model, supports up to a million‑token context (10 million internally), can understand text, video, audio and code, learns a new language from a single prompt, and is already being used by Samsung, Jasper and Quora, positioning it as a direct challenger to OpenAI’s flagship models.

Gemini 1.5Google AILLM
0 likes · 7 min read
Google's Gemini 1.5: Breakthrough in Long-Context Understanding and Multimodal Capabilities
DataFunTalk
DataFunTalk
Aug 12, 2022 · Artificial Intelligence

Multi‑Task Learning for Sample Selection Bias in Financial Risk Control

This article presents a comprehensive study on addressing sample selection bias in credit risk modeling by applying multi‑task learning techniques, including MoE/MMoE, ESMM, hierarchical attention, and semi‑supervised loss, and demonstrates their effectiveness through two real‑world application cases and experimental results.

Financial AIMoErisk control
0 likes · 14 min read
Multi‑Task Learning for Sample Selection Bias in Financial Risk Control