How New LLM Architectures Like Gemma 4 and DeepSeek V4 Cut Long‑Context Costs

The article surveys recent open‑weight LLM releases—Gemma 4, Laguna XS.2, ZAYA1‑8B and DeepSeek V4—detailing how KV‑cache sharing, per‑layer embeddings, layer‑wise attention budgeting, compressed convolutional attention and manifold‑constrained hyper‑connections dramatically reduce memory and compute for ultra‑long contexts while preserving model quality.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
How New LLM Architectures Like Gemma 4 and DeepSeek V4 Cut Long‑Context Costs

Gemma 4: Cross‑Layer KV Sharing and Per‑Layer Embeddings

Gemma 4, released by Google in early 2024, offers three model families (E2B, E4B and 31B). The smallest variants introduce two efficiency‑oriented mechanisms. First, KV sharing reuses the Key‑Value tensors across Transformer layers: only the first 15 of 35 layers in Gemma 4 E2B compute their own KV projections, while the remaining 20 layers reuse the most recent KV tensor of the same type. This reduces the KV cache size by roughly 50 % and saves about 2.7 GB of VRAM for a 128K context at bfloat16 precision (E2B) and 6 GB for E4B.

The second mechanism, Per‑Layer Embeddings (PLE) , decouples the effective parameter count from the total parameter count. Gemma 4 E2B is labeled as 2.3 B effective parameters but actually contains 5.1 B parameters when the embedding table is included; similarly, E4B’s effective size is 4.5 B versus ~8 B total. PLE stores a small token‑specific embedding slice for each layer, which is added as an extra residual after the Feed‑Forward block, increasing expressiveness without expanding the main Transformer stack.

Gemma 4 architecture diagram
Gemma 4 architecture diagram

Laguna XS.2: Layer‑wise Attention Budgeting

Poolside’s Laguna XS.2 (40 layers) allocates different attention budgets per layer. Thirty layers use Sliding‑Window Attention with a 512‑token window, while ten layers employ Global (Full) Attention. Moreover, the model varies the number of Query heads per layer: Sliding‑Window layers receive more Query heads, and Global layers receive fewer, keeping the KV‑head count fixed at eight. This “layer‑wise head budgeting” concentrates attention capacity where it is most needed, reducing overall FLOPs while preserving long‑range access.

Laguna XS.2 architecture diagram
Laguna XS.2 architecture diagram

ZAYA1‑8B: Compressed Convolutional Attention (CCA)

ZAYA1‑8B, released by Zyphra, combines a 4:1 Grouped‑Query Attention (GQA) with Compressed Convolutional Attention (CCA) . Unlike Multi‑Head Latent Attention (MLA), which compresses KV representations, CCA compresses Q, K and V together and performs attention directly in the compressed latent space. A convolutional mixing step is applied to the compressed Q and K tensors before score computation, mitigating the expressive loss from compression. Experiments in the original CCA paper (Oct 2025) show that CCA outperforms MLA under identical compression settings.

ZAYA1‑8B Transformer block with CCA
ZAYA1‑8B Transformer block with CCA

DeepSeek V4: Manifold‑Constrained Hyper‑Connections and CSA/HCA

DeepSeek V4 (2025‑2026) introduces two major architectural upgrades. The first, mHC (Manifold‑Constrained Hyper‑Connections) , expands the residual stream into multiple parallel streams and mixes them via learned mappings constrained to be doubly‑stochastic matrices. This stabilises information flow without noticeably increasing FLOPs (training on a 27 B model adds only ~6.7 % overhead when combined with recomputation and pipeline scheduling). The second upgrade replaces the traditional KV cache with a hybrid of Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) . CSA applies a light compression together with a sparse selector, while HCA aggressively compresses every 128 tokens into a single KV entry and then performs dense attention on the compressed sequence. Both mechanisms retain a local Sliding‑Window branch for recent tokens.

According to the DeepSeek V4 technical report, the hybrid design reduces per‑token inference FLOPs to 27 % of the DeepSeek V3.2 baseline and shrinks the KV cache to 10 % of its size for 1 M‑token contexts; the “Flash” variant pushes these numbers to 10 % FLOPs and 7 % KV cache.

DeepSeek V4 architecture overview
DeepSeek V4 architecture overview

Overall Trends in 2026 LLM Design

All four models illustrate a clear industry shift: rather than shrinking overall model size, researchers are introducing structural optimisations that specifically target the quadratic cost of long‑context attention. Techniques such as cross‑layer KV sharing, per‑layer embeddings, layer‑wise head budgeting, compressed convolutional attention, and manifold‑constrained hyper‑connections collectively enable 128K‑plus contexts with far lower memory and compute footprints, albeit at the cost of increased implementation complexity.

These advances suggest that future LLM research will continue to focus on modular, memory‑efficient attention variants while preserving the expressive power of the core Transformer block.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMTransformerAttention optimizationModel architectureKV cache
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.