Artificial Intelligence 25 min read

How New LLM Architectures Like Gemma 4 and DeepSeek V4 Cut Long‑Context Costs

Recent open‑weight LLMs such as Gemma 4, Laguna XS.2, ZAYA1‑8B, and DeepSeek V4 introduce KV‑cache sharing, per‑layer embeddings, layer‑wise attention budgeting, and compressed attention mechanisms that dramatically reduce memory and compute overhead for very long contexts while preserving model quality.

Machine Heart

May 19, 2026

How New LLM Architectures Like Gemma 4 and DeepSeek V4 Cut Long‑Context Costs

Background – Users increasingly demand longer context windows, but token consumption grows because the KV cache and attention computation scale with sequence length. The article surveys several 2024‑2025 models that address this cost.

Gemma 4 KV‑Cache Sharing – Gemma 4 adopts Grouped Query Attention (GQA) and extends it with cross‑layer KV sharing: only the first 15 of 35 layers in the E2B variant compute their own KV projections, while the remaining layers reuse the KV tensors from earlier layers. This reduces the KV cache size by roughly 50%, saving about 2.7 GB of VRAM for E2B and 6 GB for E4B at 128K context in bfloat16. The technique follows the NeurIPS 2024 paper “Reducing Transformer Key‑Value Cache Size with Cross‑Layer Attention”.

Per‑Layer Embeddings (PLE) – Gemma 4 also introduces per‑layer embeddings, separating token‑specific embedding vectors from the main Transformer stack. The E2B model is labeled as 2.3 B effective parameters but actually contains 5.1 B total parameters when embeddings are counted; E4B has 4.5 B effective versus ~8 B total. PLE adds a small token‑specific vector after the Feed‑Forward block, improving expressiveness without expanding the main hidden size.

Laguna XS.2 Layer‑Wise Attention Budgeting – Laguna XS.2 (Poolside) uses a mixture of Sliding‑Window and Global attention layers. It allocates more query heads to the cheaper Sliding‑Window layers (e.g., 8 heads per KV head) and fewer to Global layers (e.g., 6 heads per KV head). The configuration is exposed via num_attention_heads_per_layer in the model’s config.json. This dynamic budgeting reduces compute on expensive full‑attention layers while preserving global context.

ZAYA1‑8B Compressed Convolutional Attention (CCA) – Zyphra’s ZAYA1‑8B combines Grouped Query Attention (4:1) with CCA, which compresses Q, K, and V into a latent space and applies convolutional mixing to the compressed Q/K before attention scoring. Experiments in the CCA paper (Oct 2025) show CCA outperforms Multi‑Head Latent Attention (MLA) under the same compression settings. The model also uses a highly sparse MoE where each token activates a single expert.

DeepSeek V4 – Manifold‑Constrained Hyper‑Connections (mHC) – DeepSeek V4 replaces the single residual stream with multiple parallel streams that exchange information via learned linear mappings. mHC constrains these mappings to be doubly‑stochastic matrices, ensuring stable redistribution of residual information. In a 27 B OLMo‑style experiment, FLOPs increased by only ~0.02 G per token while performance improved modestly. Training overhead is limited (≈6.7 % extra time) when combined with fusion, recomputation, and pipeline scheduling.

DeepSeek V4 – Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) – To tackle the quadratic cost of long contexts, DeepSeek V4 interleaves CSA (light compression with a sparse selector) and HCA (aggressive compression of 128 tokens into a single KV entry). Both retain a local Sliding‑Window branch for recent tokens. Compared to DeepSeek V3.2, DeepSeek V4‑Pro reduces per‑token inference FLOPs to 27 % and KV cache size to 10 %; the Flash variant further cuts FLOPs to 10 % and KV cache to 7 % for 1 M‑token contexts.

Overall Trends – The 2026 generation of open‑source LLMs focuses on structural optimizations—KV sharing, per‑layer embeddings, dynamic attention budgeting, compressed latent attention, and manifold‑constrained residual streams—to achieve truly long‑context inference without simply shrinking model size. These advances increase implementation complexity but are essential for scaling context windows efficiently.

Machine Heart editorial logo

Compressed Convolutional Attention diagram

Hyper‑Connections vs baseline performance

Transformer block with Hyper‑Connections

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

architecture LLM Long Context Efficient Inference compressed attention KV sharing

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.