Jamba: How AI21 Labs Merged Mamba and Transformer for 3× Faster 128k Contexts
Jamba, a hybrid Mamba‑Transformer model from AI21 Labs, combines state‑space and attention layers with Mixture‑of‑Experts to deliver up to three times the throughput of comparable 52‑billion‑parameter LLMs on 128k context windows while maintaining high output quality and low memory usage.
AI21 Labs introduced Jamba, a 52‑billion‑parameter large language model that fuses the Mamba state‑space architecture with traditional Transformer layers and a Mixture‑of‑Experts (MoE) component, aiming to achieve both high model quality and efficient inference.
Mamba and Transformer Fusion
The original Mamba model, proposed by CMU and Princeton, addresses the memory and speed limitations of Transformers as context length grows, but it suffers from poorer output quality, especially on retrieval‑oriented tasks. Jamba resolves this by integrating Mamba blocks with Transformer blocks in a novel "blocks‑and‑layers" scheme.
Each Jamba block consists of either an attention (Transformer) layer or a Mamba layer followed by a feed‑forward MLP. The design maintains a ratio of one Transformer layer for every eight layers, ensuring a balanced contribution from both architectures.
To increase total parameter count without inflating active inference parameters, Jamba incorporates a MoE layer. During inference, only 12 billion of the 52 billion parameters are active, delivering higher efficiency while preserving model capacity.
"Absolute big news." – Comment from the original Mamba authors.
Throughput and Efficiency Gains
Preliminary benchmarks show that Jamba’s throughput on 128k context windows is roughly three times that of Mixtral‑8x‑7B, reaching about 1,500 tokens per second compared to ~500 tokens per second for Mixtral. The model also supports up to 256k context length, with a single 80 GB A100 GPU handling 140k tokens, surpassing Mixtral (64k) and Llama‑2 70B (16k).
In terms of output quality, Jamba attains state‑of‑the‑art results on three out of four evaluated inference benchmarks and remains competitive on GSM8K, overall matching Mixtral‑8x‑7B performance.
Future work includes further MoE parallelism and faster Mamba implementations, which are expected to boost performance even more.
Jamba is now publicly available on Hugging Face under the Apache‑2.0 license, with an instruction‑tuned version slated for release on the AI21 Labs platform.
Reference links:
https://huggingface.co/ai21labs/Jamba-v0.1
https://www.ai21.com/blog/announcing-jamba
https://www.ai21.com/jamba
https://twitter.com/AI21Labs/status/1773350888427438424?s=20
https://twitter.com/tri_dao/status/1773418926518734957?s=20
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
