Artificial Intelligence 17 min read

What Makes Mistral’s 7B, Mixtral, and Large 2 Models Stand Out? A Deep Technical Dive

This article compiles key technical details of the Mistral model family—including Mistral 7B, Mixtral 8×7B, Mixtral 8×22B, Mistral Nemo, and Mistral Large 2—covering their architectural innovations such as sliding‑window attention, grouped‑query attention, mixture‑of‑experts design, scaling parameters, performance benchmarks, quantization requirements, and practical deployment commands.

Baobao Algorithm Notes

Jul 31, 2024

What Makes Mistral’s 7B, Mixtral, and Large 2 Models Stand Out? A Deep Technical Dive

Overview

The Mistral series comprises several open‑source large language models (LLMs) ranging from 7 billion to 123 billion parameters. Each model introduces specific architectural tweaks—sliding‑window attention, grouped‑query attention (GQA), and mixture‑of‑experts (MoE)—to improve inference speed, memory efficiency, and task performance. The article aggregates official blog links, arXiv papers, and Hugging Face repositories for reference.

Mistral 7B

Key features include a 4096‑token window size and a 32‑layer transformer. The model originally used Sliding Window Attention (SWA) to extend the effective context to roughly 131 K tokens, though later Hugging Face checkpoints (e.g., mistral-7b-instruct-v0.2) disable SWA.

The SWA implementation relies on a fixed‑size cache of length W and updates the cache at position i mod M. The following code excerpt shows how the causal mask is built in the Hugging Face implementation:

# huggignface mistral attn mask implementation

def _update_causal_mask(self, attention_mask: torch.Tensor, input_tensor: torch.Tensor, cache_position: torch.Tensor, past_key_values: Cache):
    # ... omitted unrelated code ...
    past_seen_tokens = cache_position[0] if past_key_values is not None else 0
    using_static_cache = isinstance(past_key_values, StaticCache)
    using_sliding_window_cache = isinstance(past_key_values, SlidingWindowCache)
    dtype, device = input_tensor.dtype, input_tensor.device
    min_dtype = torch.finfo(dtype).min
    sequence_length = input_tensor.shape[1]
    if using_sliding_window_cache:
        target_length = max(sequence_length, self.config.sliding_window)
    elif using_static_cache:
        target_length = past_key_values.get_max_length()
    else:
        target_length = (attention_mask.shape[-1] if isinstance(attention_mask, torch.Tensor) else past_seen_tokens + sequence_length + 1)
    # mask construction omitted for brevity
    return causal_mask

Grouped Query Attention (GQA) replaces the standard multi‑head attention with a single query per head, reducing memory traffic while preserving quality. Experiments in the GQA paper show a trade‑off between speed and output quality, and many Mistral‑derived models (including Llama 2) adopt eight KV heads.

Why are MQA and GQA popular? They lower GPU memory bandwidth demands, allowing faster inference with less data movement.

Mixtral 8×7B

Mixtral combines a standard decoder with an MoE layer where only the feed‑forward network (FFN) is split into multiple experts. The model has eight expert MLP layers, totaling about 45 B parameters.

Release: December 2023

Architecture: 8 experts, each with a top‑2 routing policy

Training: Pre‑training plus SFT + DPO fine‑tuning

The core MoE logic replaces the usual FFN with block_sparse_moe. The routing gate selects the top‑2 experts per token, and the final hidden state is a weighted sum of the selected experts’ outputs.

class MixtralSparseMoeBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.hidden_dim = config.hidden_size
        self.ffn_dim = config.intermediate_size
        self.num_experts = config.num_local_experts
        self.top_k = config.num_experts_per_tok
        self.gate = nn.Linear(self.hidden_dim, self.num_experts, bias=False)
        self.experts = nn.ModuleList([MixtralBlockSparseTop2MLP(config) for _ in range(self.num_experts)])
        self.jitter_noise = config.router_jitter_noise

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        batch_size, seq_len, hidden_dim = hidden_states.shape
        hidden_states = hidden_states.view(-1, hidden_dim)
        router_logits = self.gate(hidden_states)
        routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float)
        routing_weights, selected_experts = torch.topk(routing_weights, self.top_k, dim=-1)
        routing_weights /= routing_weights.sum(dim=-1, keepdim=True)
        routing_weights = routing_weights.to(hidden_states.dtype)
        final_hidden = torch.zeros(batch_size * seq_len, hidden_dim, dtype=hidden_states.dtype, device=hidden_states.device)
        expert_mask = torch.nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2,1,0)
        for expert_idx in range(self.num_experts):
            expert_layer = self.experts[expert_idx]
            idx, top_x = torch.where(expert_mask[expert_idx])
            current_state = hidden_states[None, top_x].reshape(-1, hidden_dim)
            current_hidden = expert_layer(current_state) * routing_weights[top_x, idx, None]
            final_hidden.index_add_(0, top_x, current_hidden.to(hidden_states.dtype))
        final_hidden = final_hidden.reshape(batch_size, seq_len, hidden_dim)
        return final_hidden, router_logits

Inference memory requirements: fp16 needs >90 GB VRAM; 4‑bit quantization reduces this to ~30 GB. Performance on a single RTX 4090 + 7950X3D reaches ~20 tokens/s, while a dual‑3090 setup achieves ~48 tokens/s (Q4_K_M quantization).

docker run --gpus all \
    -e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
    ghcr.io/mistralai/mistral-src/vllm:latest \
    --host 0.0.0.0 \
    --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
    --tensor-parallel-size 2 # 100+GB VRAM \
    --load-format pt

NVIDIA’s TensorRT‑LLM benchmark shows Mixtral 8×7B can sustain up to 7500 tokens/s on a dual‑H100 deployment for 128‑token sequences, translating to roughly 0.02 kWh per million tokens.

Mixtral 8×22B

The 8×22B variant shares the same MoE architecture as the 8×7B model but scales parameters and expands the context window to 65 K tokens. It uses the same MixtralForCausalLM class on Hugging Face.

Mistral Nemo

Mistral Nemo adopts the standard MistralForCausalLM architecture with notable changes: hidden size increased to 5120, max position embeddings to 1 024 000, 40 transformer layers, and a vocab size of 131 072. It supports function calling and uses the Tekken tokenizer, claimed to be ~30 % more compression‑efficient than SentencePiece.

Training leveraged Megatron‑LM on 3 072 H100 80 GB GPUs. FP16 inference consumes ~23 GB VRAM; full‑context inference still requires off‑loading or further quantization.

Mistral Large 2

The largest model in the series, Mistral Large 2, has 123 B parameters and a 131 072‑token context window. It retains the same base architecture as Mistral 7B, supports function calling, and excels in coding, mathematics, and multilingual tasks. Benchmarks show it outperforms Llama 3.1 on code generation and MT‑Bench evaluations, while its Chinese language capability improves markedly over the previous large model.