Tagged articles

10 articles

Page 1 of 1

Jan 22, 2026 · Artificial Intelligence

How STEM Replaces MoE Routing with Simple Table Lookup for Faster Transformers

The article presents STEM, a method that transforms dense and MoE transformer architectures by converting the expert routing step into a static table‑lookup operation, achieving higher parameter efficiency, lower communication overhead, and improved interpretability while maintaining or boosting downstream task performance.

Embedding LookupInterpretabilityMixture of Experts

0 likes · 6 min read

How STEM Replaces MoE Routing with Simple Table Lookup for Faster Transformers

DataFunTalk

Jan 13, 2026 · Artificial Intelligence

How Conditional Memory (Engram) Boosts Large Language Models Beyond MoE

DeepSeek's new paper introduces a conditional memory mechanism called Engram that complements Mixture‑of‑Experts, providing O(1) lookup, improving knowledge retrieval, reasoning, and long‑context performance while scaling efficiently on the same FLOPs budget.

EngramSparse Modelsconditional memory

0 likes · 18 min read

How Conditional Memory (Engram) Boosts Large Language Models Beyond MoE

AI Algorithm Path

May 9, 2025 · Artificial Intelligence

A Visual Guide to Mixture of Experts (MoE) Architecture in Large Language Models

This article explains the Mixture of Experts (MoE) technique used in modern LLMs, detailing its core components—experts and router—comparing dense and sparse layers, describing load‑balancing, expert capacity, and routing strategies, and showcasing real‑world examples such as Switch Transformer, Vision‑MoE, and Mixtral 8x7B.

Expert CapacityLLMMixture of Experts

0 likes · 15 min read

A Visual Guide to Mixture of Experts (MoE) Architecture in Large Language Models

AI Frontier Lectures

Apr 27, 2025 · Artificial Intelligence

How Jeff Dean’s Vision Shaped Modern AI: From Neural Nets to Gemini

Jeff Dean’s 2024 ETH Zurich talk traces fifteen years of AI breakthroughs—from the rise of neural networks and back‑propagation, through large‑scale distributed training, TPUs, Transformers, sparse MoE models, and advanced prompting techniques—showing how scaling compute, data, and clever software have driven today’s powerful Gemini models.

AIChain-of-ThoughtDistillation

0 likes · 18 min read

How Jeff Dean’s Vision Shaped Modern AI: From Neural Nets to Gemini

Architect

Mar 2, 2025 · Artificial Intelligence

Demystifying Mixture of Experts: How MoE Boosts LLMs and Vision Models

This article explains the Mixture of Experts (MoE) architecture, detailing experts, routers, dense vs. sparse layers, load‑balancing strategies such as KeepTopK, auxiliary loss, capacity constraints, the Switch Transformer simplification, and how MoE is applied to both language and vision models, illustrated with concrete examples and parameter counts.

Mixture of ExpertsMoESparse Models

0 likes · 17 min read

Demystifying Mixture of Experts: How MoE Boosts LLMs and Vision Models

Architect

Feb 10, 2025 · Artificial Intelligence

Evolution of DeepSeek Mixture‑of‑Experts (MoE) Architecture from V1 to V3

This article reviews the development of DeepSeek's Mixture-of-Experts (MoE) models, tracing their evolution from the original DeepSeekMoE V1 through V2 to V3, detailing architectural innovations such as fine‑grained expert segmentation, shared‑expert isolation, load‑balancing losses, device‑limited routing, and the shift from softmax to sigmoid gating.

DeepSeekLLMMixture of Experts

0 likes · 21 min read

Evolution of DeepSeek Mixture‑of‑Experts (MoE) Architecture from V1 to V3

JD Retail Technology

Aug 30, 2024 · Artificial Intelligence

GPU Optimization Practices for Training and Inference in JD Advertising Recommendation Systems

The article details JD Advertising's technical challenges and solutions for large‑scale sparse recommendation models, describing GPU‑focused storage, compute and I/O optimizations for both training and low‑latency inference, including distributed pipelines, heterogeneous deployment, batch aggregation, multi‑stream execution, and compiler extensions.

Distributed SystemsGPU OptimizationInference

0 likes · 13 min read

GPU Optimization Practices for Training and Inference in JD Advertising Recommendation Systems

Alibaba Cloud Big Data AI Platform

May 24, 2024 · Artificial Intelligence

How DeepRec Extension Boosts Distributed Sparse Model Training with Elasticity and Fault Tolerance

DeepRec Extension enhances large‑scale sparse model training by adding automatic elastic training, resource‑aware scheduling, real‑time monitoring, and efficient fault‑tolerance mechanisms, enabling lower cost, higher throughput, and more reliable distributed training for AI workloads.

AI InfrastructureDeepRecSparse Models

0 likes · 13 min read

How DeepRec Extension Boosts Distributed Sparse Model Training with Elasticity and Fault Tolerance

DataFunSummit

Mar 12, 2023 · Artificial Intelligence

PaddleBox and FeaBox: GPU‑Based Large‑Scale Sparse Model Training and Integrated Feature Extraction Frameworks at Baidu

The article introduces PaddleBox and FeaBox, two GPU‑driven frameworks designed for massive sparse DNN training and unified feature extraction, detailing their architecture, performance advantages, hardware‑software co‑design challenges, and successful deployment across Baidu's advertising systems.

FeaBoxGPUPaddleBox

0 likes · 24 min read

PaddleBox and FeaBox: GPU‑Based Large‑Scale Sparse Model Training and Integrated Feature Extraction Frameworks at Baidu

DataFunTalk

Apr 17, 2022 · Artificial Intelligence

DeepRec: Alibaba’s Sparse Model Training Engine – Architecture, Features, and Open‑Source Status

DeepRec, developed since 2016 by Alibaba, is a specialized sparse‑model training engine that addresses feature elasticity, training performance, and deployment challenges through dynamic elastic features, optimized runtimes, distributed training frameworks, incremental model export, and multi‑level storage, and is now being open‑sourced for broader industry collaboration.

AI InfrastructureDeepRecRuntime Optimization

0 likes · 15 min read

DeepRec: Alibaba’s Sparse Model Training Engine – Architecture, Features, and Open‑Source Status