Tagged articles
126 articles
Page 2 of 2
Baobao Algorithm Notes
Baobao Algorithm Notes
Jan 3, 2025 · Artificial Intelligence

How DeepSeek-V3 Achieves Massive Scale with FP8, MoE, and System Optimizations

The article examines DeepSeek‑V3’s architecture and training pipeline, highlighting its use of MLA and a highly granular MoE design, pioneering FP8 mixed‑precision training, fine‑grained per‑tile quantization, advanced parallelism strategies, and inference optimizations such as PD separation and NanoFlow to achieve unprecedented efficiency on limited GPU resources.

DeepSeek-V3FP8Inference Optimization
0 likes · 10 min read
How DeepSeek-V3 Achieves Massive Scale with FP8, MoE, and System Optimizations
Tencent Cloud Developer
Tencent Cloud Developer
Nov 6, 2024 · Artificial Intelligence

Overview of Tencent Hunyuan Large and 3D Generation Model Open‑Source Release

Tencent has open‑sourced its 389‑billion‑parameter Hunyuan Large Mixture‑of‑Experts model—featuring 52 B active parameters, 256 K token context, novel routing, KV‑cache compression, and advanced training optimizations that beat leading open‑source models—and its first text‑to‑3D/image‑to‑3D Hunyuan 3D Generation model, both downloadable via GitHub, Hugging Face, and Tencent Cloud.

3D generationAI researchMixture of Experts
0 likes · 9 min read
Overview of Tencent Hunyuan Large and 3D Generation Model Open‑Source Release
NewBeeNLP
NewBeeNLP
Oct 21, 2024 · Artificial Intelligence

Why Do MOE Experts Collapse? An In‑Depth Look at HOME’s Multi‑Task Architecture

This article analyzes the polarization issues in industrial Mixture‑of‑Experts (MoE) frameworks, explains expert collapse, degradation, and under‑fitting, and details the HOME model’s input types, architectural innovations, normalization, gating mechanisms, and related DICE‑BN insights.

Expert NormalizationGating MechanismsMixture of Experts
0 likes · 10 min read
Why Do MOE Experts Collapse? An In‑Depth Look at HOME’s Multi‑Task Architecture
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 9, 2024 · Artificial Intelligence

How MoSLoRA Reinvents Low‑Rank Adaptation with Mixer Matrices

This article analyzes the Mixture‑of‑Subspaces in Low‑Rank Adaptation (MoSLoRA) paper, explaining its motivation, design choices that replace LoRA's gate with a mixer matrix, connections to multi‑head attention, experimental findings on LLaMA‑3 fine‑tuning, and theoretical proofs of its re‑parameterization properties.

AILoRAMixture of Experts
0 likes · 12 min read
How MoSLoRA Reinvents Low‑Rank Adaptation with Mixer Matrices
Baobao Algorithm Notes
Baobao Algorithm Notes
Jul 31, 2024 · Artificial Intelligence

What Makes Mistral’s 7B, Mixtral, and Large 2 Models Stand Out? A Deep Technical Dive

This article compiles key technical details of the Mistral model family—including Mistral 7B, Mixtral 8×7B, Mixtral 8×22B, Mistral Nemo, and Mistral Large 2—covering their architectural innovations such as sliding‑window attention, grouped‑query attention, mixture‑of‑experts design, scaling parameters, performance benchmarks, quantization requirements, and practical deployment commands.

Grouped Query AttentionMistralMixtral
0 likes · 17 min read
What Makes Mistral’s 7B, Mixtral, and Large 2 Models Stand Out? A Deep Technical Dive
360 Smart Cloud
360 Smart Cloud
Jul 4, 2024 · Artificial Intelligence

Optimizing Mixture-of-Experts (MoE) Training with the QLM Framework

This article introduces the background and challenges of large language model training, explains the Mixture-of-Experts (MoE) architecture, and details several optimization techniques implemented in the QLM framework—including fine-grained and shared experts, top‑k gating, token distribution, expert parallelism, and grouped GEMM – to improve training efficiency and performance.

AIDistributed TrainingMixture of Experts
0 likes · 10 min read
Optimizing Mixture-of-Experts (MoE) Training with the QLM Framework
NewBeeNLP
NewBeeNLP
Jun 7, 2024 · Artificial Intelligence

Scaling Laws, Synthetic Data, and New Model Architectures: What’s Next?

In a recent round‑table, experts debated the validity of scaling laws, the role of synthetic and semi‑synthetic data in overcoming data scarcity, explored alternatives to Transformers such as RNN‑based models and MOE, and examined techniques for handling long‑context inference efficiently.

Mixture of ExpertsModel architecturescaling laws
0 likes · 12 min read
Scaling Laws, Synthetic Data, and New Model Architectures: What’s Next?
Baobao Algorithm Notes
Baobao Algorithm Notes
May 31, 2024 · Industry Insights

Do Scaling Laws Still Hold? Deep Dive into Synthetic Data, New Model Architectures, and Long‑Context Solutions

In a May 15 round‑table, experts debated the validity of scaling laws, the role of synthetic and semi‑synthetic data in overcoming data bottlenecks, explored alternatives to the Transformer such as RNN‑based and hybrid designs, evaluated the practicality of Mixture‑of‑Experts models, and examined two main strategies—KV‑cache compression and input‑context reduction—to enable truly long‑context processing.

Mixture of Expertslong context
0 likes · 13 min read
Do Scaling Laws Still Hold? Deep Dive into Synthetic Data, New Model Architectures, and Long‑Context Solutions
Kuaishou Tech
Kuaishou Tech
May 27, 2024 · Artificial Intelligence

What Kuaishou’s Four ACL Papers Reveal About the Future of Large Language Models

The 62nd ACL conference accepted four papers from Kuaishou that explore multi‑turn instruction following, self‑agreement reasoning, fine‑grained reinforcement learning, and dynamic routing in Mixture‑of‑Experts models, each with detailed methods, experimental results, author lists, and public arXiv links.

ACL 2024Kuaishou ResearchMixture of Experts
0 likes · 11 min read
What Kuaishou’s Four ACL Papers Reveal About the Future of Large Language Models
DeWu Technology
DeWu Technology
May 15, 2024 · Artificial Intelligence

Accelerating Large Language Model Inference: Techniques and Framework Recommendations

Deploying a dedicated inference cluster and applying four key optimizations—FlashAttention‑based attention computation, PageAttention KV‑cache management, Mixture‑of‑Experts parameter reduction, and tensor parallelism—can accelerate large language model inference by up to 50% for models as large as 70 B parameters while cutting deployment costs.

FlashAttentionInference AccelerationMixture of Experts
0 likes · 17 min read
Accelerating Large Language Model Inference: Techniques and Framework Recommendations
Baobao Algorithm Notes
Baobao Algorithm Notes
May 6, 2024 · Artificial Intelligence

DeepSeek-V2: 236B MoE LLM Delivers Higher Performance While Cutting Training Cost by 42%

DeepSeek‑V2 is a 236‑billion‑parameter mixture‑of‑experts language model that reduces training cost by 42.5 %, cuts KV‑cache usage by 93.3 %, and boosts generation throughput 5.76×, while achieving state‑of‑the‑art scores on benchmarks such as MMLU, C‑Eval, BBH, HumanEval, and GSM8K for both base and chat variants.

AIDeepSeek-V2Mixture of Experts
0 likes · 11 min read
DeepSeek-V2: 236B MoE LLM Delivers Higher Performance While Cutting Training Cost by 42%
NewBeeNLP
NewBeeNLP
Apr 2, 2024 · Artificial Intelligence

Jamba: How AI21 Labs Merged Mamba and Transformer for 3× Faster 128k Contexts

Jamba, a hybrid Mamba‑Transformer model from AI21 Labs, combines state‑space and attention layers with Mixture‑of‑Experts to deliver up to three times the throughput of comparable 52‑billion‑parameter LLMs on 128k context windows while maintaining high output quality and low memory usage.

JambaLLMMamba
0 likes · 6 min read
Jamba: How AI21 Labs Merged Mamba and Transformer for 3× Faster 128k Contexts
21CTO
21CTO
Mar 29, 2024 · Artificial Intelligence

Why Databricks’ Open‑Source DBRX LLM Is Outpacing GPT‑3.5 and Llama 2

Databricks unveiled the open‑source DBRX large language model, which leverages a mixed‑expert architecture to deliver faster, more cost‑effective inference and beats leading open‑source and proprietary models like Llama 2, Mixtral‑8x7B, and GPT‑3.5 on multiple benchmarks.

AIDBRXDatabricks
0 likes · 7 min read
Why Databricks’ Open‑Source DBRX LLM Is Outpacing GPT‑3.5 and Llama 2
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Mar 20, 2024 · Artificial Intelligence

Elon Musk’s xAI Open‑Sources Grok‑1: A 314‑Billion‑Parameter MoE Large Language Model

Elon Musk’s xAI has open‑sourced Grok‑1, a 314‑billion‑parameter mixture‑of‑experts language model built with Rust and JAX, released under an Apache‑2.0 license, and the announcement includes detailed architecture specs, hardware requirements, and the broader context of Musk’s rivalry with OpenAI.

AIGrok-1Mixture of Experts
0 likes · 6 min read
Elon Musk’s xAI Open‑Sources Grok‑1: A 314‑Billion‑Parameter MoE Large Language Model
DataFunTalk
DataFunTalk
Mar 14, 2024 · Artificial Intelligence

Efficiency Challenges and Multi‑Layer Optimization for Large AI Models

The article examines how large AI models are moving toward a unified paradigm that reduces task‑algorithm coupling, outlines multi‑layer efficiency challenges—from model compression and sparsity to software and infrastructure optimization—and highlights NVIDIA’s GTC 2024 China AI Day sessions showcasing the latest LLM technologies and registration details.

AI efficiencyMixture of ExpertsNVIDIA GTC
0 likes · 13 min read
Efficiency Challenges and Multi‑Layer Optimization for Large AI Models
Baobao Algorithm Notes
Baobao Algorithm Notes
Mar 10, 2024 · Artificial Intelligence

Unlocking Large Model Power: 5 Effective Model Fusion Techniques Explained

This article examines why ensemble methods are crucial for large language models, outlines five core fusion strategies—including model integration, probability integration, graft learning, crowdsourced voting, and Mixture of Experts—provides implementation details, pseudo‑code, and discusses practical challenges and recent research advances.

AI researchMixture of ExpertsModel Fusion
0 likes · 16 min read
Unlocking Large Model Power: 5 Effective Model Fusion Techniques Explained
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jan 29, 2024 · Artificial Intelligence

Unlocking Sparse MoE Large Model Training with Megatron-Core on Alibaba Cloud

This article explains how Alibaba Cloud's PAI platform and NVIDIA's Megatron-Core enable efficient training of sparse Mixture-of-Experts (MoE) large language models, covering algorithm basics, the Megatron-Core MoE framework, weight conversion pipelines, and performance results on Mixtral‑8x7B.

Megatron-CoreMixture of ExpertsModel Parallelism
0 likes · 18 min read
Unlocking Sparse MoE Large Model Training with Megatron-Core on Alibaba Cloud
Baobao Algorithm Notes
Baobao Algorithm Notes
Jan 2, 2024 · Artificial Intelligence

Uncovering Mixtral‑8x7B: How MoE Experts Shape Performance and Training

This article analyses the Mixtral‑8x7B Mixture‑of‑Experts LLM, explains its gate‑driven 8‑expert architecture, presents a simplified PyTorch implementation, and reports a series of experiments that probe top‑2 gating during training, individual expert contributions, task‑specific pre‑training, the impact of expert count, and similarity with Mistral‑7B, ultimately offering hypotheses about its training pipeline.

LLMMixtralMixture of Experts
0 likes · 14 min read
Uncovering Mixtral‑8x7B: How MoE Experts Shape Performance and Training
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Nov 3, 2023 · Artificial Intelligence

Can LLMs Master Lifelong Learning? Exploring MoE and Continuous Adaptation

This article explains how large language models can achieve continual lifelong learning, outlines the key properties required, reviews mixture‑of‑experts (MoE) techniques—including sparse MoE, GShard, Switch Transformer, GLaM and PanGu‑Sigma—and discusses the remaining challenges such as model complexity, expert balancing and distributed communication overhead.

LLMLifelong LearningMixture of Experts
0 likes · 9 min read
Can LLMs Master Lifelong Learning? Exploring MoE and Continuous Adaptation
Tencent Advertising Technology
Tencent Advertising Technology
Mar 2, 2023 · Artificial Intelligence

Tencent's HunYuan‑NLP 1T Large‑Scale AI Model: Training Techniques, Optimization, and Real‑World Applications

This article details Tencent's development of the 1‑trillion‑parameter HunYuan‑NLP model, covering its MoE architecture, cost‑effective pre‑training strategies, distributed training framework, model compression toolkit, and successful deployment across advertising, gaming, and other Tencent services.

AI InfrastructureMixture of Expertslarge language model
0 likes · 17 min read
Tencent's HunYuan‑NLP 1T Large‑Scale AI Model: Training Techniques, Optimization, and Real‑World Applications
Meituan Technology Team
Meituan Technology Team
Dec 8, 2022 · Artificial Intelligence

Contextualized Recommendation in Meituan Takeaway: Segmented & Unified Modeling, Long‑Sequence Retrieval, and Multi‑Expert Networks

Meituan Takeaway’s recommendation system partitions user contexts such as time, location, entry page, and business type, then uses a unified model with long‑sequence retrieval and a multi‑expert Mixture‑of‑Experts network to deliver context‑aware food‑delivery suggestions, achieving notable CTR and conversion gains while maintaining low latency.

MeituanMixture of Expertscontextual modeling
0 likes · 32 min read
Contextualized Recommendation in Meituan Takeaway: Segmented & Unified Modeling, Long‑Sequence Retrieval, and Multi‑Expert Networks
IEG Growth Platform Technology Team
IEG Growth Platform Technology Team
Nov 28, 2022 · Artificial Intelligence

Bidden-MarfNet: Feature Missing-aware Routing-and-Fusion Network for Customer Lifetime Value Prediction

This paper presents Bidden-MarfNet, a novel architecture that explicitly encodes feature‑missing information and dynamically re‑weights samples to address feature missingness and label sparsity in user‑level LTV prediction for advertising, demonstrating superior performance over existing methods through extensive experiments.

LTV predictionMixture of Expertsdynamic weighting
0 likes · 13 min read
Bidden-MarfNet: Feature Missing-aware Routing-and-Fusion Network for Customer Lifetime Value Prediction
DataFunSummit
DataFunSummit
Apr 19, 2022 · Artificial Intelligence

DeepSpeed‑MoE: End‑to‑End Training and Inference Solutions for Mixture‑of‑Experts Models

This article reviews DeepSpeed‑MoE, an end‑to‑end system that introduces new MoE architectures, model‑compression techniques, and highly optimized inference pipelines, detailing its motivation, design of PR‑MoE (Pyramid‑MoE and Residual‑MoE), distributed parallel strategies, communication and kernel optimizations, and performance gains over dense baselines.

AIDeepSpeedInference Optimization
0 likes · 11 min read
DeepSpeed‑MoE: End‑to‑End Training and Inference Solutions for Mixture‑of‑Experts Models
DataFunSummit
DataFunSummit
Aug 16, 2021 · Artificial Intelligence

Scaling Deep Learning Models: From Depth to Width and Parallelism Strategies

The article reviews how deep learning models have grown deeper and wider, discusses the memory and bandwidth limits of single GPUs, and explains pipeline and sharding techniques—including GPU clusters and TPU pods—to efficiently train large‑scale models in industrial settings.

GPUMixture of ExpertsModel Parallelism
0 likes · 6 min read
Scaling Deep Learning Models: From Depth to Width and Parallelism Strategies
DataFunTalk
DataFunTalk
Aug 7, 2021 · Artificial Intelligence

Multi-Category Mixture-of-Experts Model for JD Search Ranking

This article presents a multi‑category Mixture‑of‑Experts (MoE) approach for e‑commerce search ranking, addressing category‑specific behavior and small‑category learning by introducing hierarchical soft constraints and adversarial regularization, and demonstrates significant AUC and NDCG gains on Amazon and JD in‑house datasets.

Adversarial RegularizationHierarchical Soft ConstraintMixture of Experts
0 likes · 10 min read
Multi-Category Mixture-of-Experts Model for JD Search Ranking
ITPUB
ITPUB
Jun 25, 2021 · Artificial Intelligence

How Alibaba’s Low‑Carbon M6 Model Trains a Trillion‑Parameter AI with 80% Less Energy

Alibaba’s DAMO Academy unveiled the low‑carbon M6 multimodal model, a trillion‑parameter AI trained on just 480 V100 GPUs, achieving over 80% energy reduction and 11‑fold speedup compared to prior trillion‑parameter efforts, and already powering e‑commerce and manufacturing design tools.

GPU efficiencyLarge ModelM6
0 likes · 5 min read
How Alibaba’s Low‑Carbon M6 Model Trains a Trillion‑Parameter AI with 80% Less Energy