Tagged articles

Grouped Query Attention

3 articles · Page 1 of 1

Feb 6, 2025 · Artificial Intelligence

DeepSeek V Series: Technical Overview of Scaling Laws, Grouped Query Attention, and Mixture‑of‑Experts

The article reviews DeepSeek’s V‑series papers, explaining how scaling‑law insights, Grouped Query Attention, a depth‑first design, loss‑free load balancing, multi‑token prediction and Multi‑Head Latent Attention together enable economical mixture‑of‑experts LLMs that rival closed‑source models while cutting compute and hardware costs.

DeepSeekGrouped Query AttentionMixture of Experts

0 likes · 13 min read

DeepSeek V Series: Technical Overview of Scaling Laws, Grouped Query Attention, and Mixture‑of‑Experts

AI2ML AI to Machine Learning

Feb 5, 2025 · Artificial Intelligence

What Optimizations Power DeepSeek’s High‑Efficiency LLMs?

The article enumerates DeepSeek’s extensive technical optimizations—including Grouped Query Attention, Multi‑head Latent Attention, Mixture‑of‑Experts, 4D parallelism, quantization, and multi‑token prediction—that together enable cheap, high‑performance large language models.

4D parallelismDeepSeekGrouped Query Attention

0 likes · 8 min read

What Optimizations Power DeepSeek’s High‑Efficiency LLMs?

Baobao Algorithm Notes

Jul 31, 2024 · Artificial Intelligence

What Makes Mistral’s 7B, Mixtral, and Large 2 Models Stand Out? A Deep Technical Dive

This article compiles key technical details of the Mistral model family—including Mistral 7B, Mixtral 8×7B, Mixtral 8×22B, Mistral Nemo, and Mistral Large 2—covering their architectural innovations such as sliding‑window attention, grouped‑query attention, mixture‑of‑experts design, scaling parameters, performance benchmarks, quantization requirements, and practical deployment commands.

Grouped Query AttentionLarge Language ModelMistral

0 likes · 17 min read

What Makes Mistral’s 7B, Mixtral, and Large 2 Models Stand Out? A Deep Technical Dive

Grouped Query Attention

DeepSeek V Series: Technical Overview of Scaling Laws, Grouped Query Attention, and Mixture‑of‑Experts

What Optimizations Power DeepSeek’s High‑Efficiency LLMs?

What Makes Mistral’s 7B, Mixtral, and Large 2 Models Stand Out? A Deep Technical Dive

What Makes Mistral’s 7B, Mixtral, and Large 2 Models Stand Out? A Deep Technical Dive