How REFRAG Cuts LLM Decoding Time by 30×: A New Efficient RAG Framework
REFRAG (REpresentation For RAG) introduces a novel decoding framework that compresses, senses, and expands context using precomputed chunk embeddings, achieving up to 30.85× faster first-token generation and 16× larger context windows without sacrificing perplexity, as validated across diverse long‑context tasks.
Background
In Retrieval‑Augmented Generation (RAG) the LLM’s context is formed by concatenating retrieved paragraphs, most of which are irrelevant to the user query. This leads to a block‑diagonal attention pattern that differs from standard generation tasks, and many decoding computations are unnecessary.
REFRAG Overview
Researchers from Meta Superintelligence Labs, NUS and Rice University propose REFRAG (REpresentation For RAG) , an efficient decoding framework that compresses, senses and expands the RAG context. By exploiting a sparse attention structure, REFRAG accelerates time‑to‑first‑token (TTFT) by up to 30.85× (3.75× over previous work) without increasing perplexity, and expands the effective context window by 16×.
Key Innovations
Chunk‑Embedding Decoding : Instead of feeding raw tokens, REFRAG uses pre‑computed compressed chunk embeddings as an approximate representation of the retrieved context and feeds them directly to the decoder.
Three Advantages
Shorter decoder input length : Replacing tokens with chunk embeddings reduces sequence length and improves token allocation efficiency.
Reuse of retrieval computation : Chunk embeddings generated during retrieval can be reused, eliminating redundant encoding.
Reduced attention complexity : Attention cost changes from quadratic in token count to quadratic in chunk count.
Compress‑Anywhere : REFRAG can compress token blocks at any position while preserving the autoregressive property, supporting multi‑turn dialogue and agent‑style interactions.
Lightweight RL Strategy : A reinforcement‑learning policy decides when to use full tokens versus compressed chunk embeddings, minimizing reliance on expensive token‑level encoding.
Model Architecture
REFRAG consists of a decoder (e.g., LLaMA) and an encoder (e.g., RoBERTa). The input is split into a small number of query tokens followed by many context tokens. Context tokens are partitioned into equal‑size chunks, each encoded into a chunk embedding, projected to the decoder’s token‑embedding space, and then combined with the query tokens for answer generation.
Performance Gains
Three metrics are evaluated: TTFT, TTIT (time‑to‑iterative‑token) and throughput. REFRAG achieves 16.53× TTFT speedup with cache (8.59× without) and up to 6.78× throughput improvement over LLaMA, while maintaining or improving perplexity.
Training Methodology
REFRAG uses a Continual Pre‑training (CPT) recipe that combines a reconstruction task with curriculum learning. The reconstruction task forces the encoder to compress a fixed number of tokens while the decoder learns to reconstruct them; the decoder is frozen during this phase. Curriculum learning gradually increases the number of chunks to reconstruct, easing optimization.
Selective Compression
An RL policy evaluates the perplexity of the next‑segment prediction and decides which chunks to keep uncompressed. This preserves important information and improves answer quality while still allowing compression elsewhere.
Experimental Results
Training data: 20 B tokens from SlimPajama Book and ArXiv. Evaluation on SlimPajama hold‑outs, PG‑19 and Proof‑Pile. Baselines include LLaMA‑2‑7B variants, CEPE, REPLUG, and full‑context LLaMA. REFRAG consistently outperforms baselines across standard (s = 2048) and extended (up to 16384) context lengths, showing strong extrapolation ability.
Selective Compression Ablation
RL‑based selective compression outperforms perplexity‑based heuristics and random selection across all compression rates.
RAG and Multi‑turn Dialogue
Fine‑tuned REFRAG models achieve higher accuracy than LLaMA under equal latency or equal number of retrieved passages, especially in weak‑retriever scenarios and long‑context dialogue where LLaMA’s 4 k token window truncates history.
Related Work
Prior efficient long‑context LLMs modify attention (e.g., compressive attention) or prompt strategies (e.g., attention sinks). REFRAG is complementary, offering pre‑computed chunk embeddings that can be used anywhere in the prompt.
Conclusion
REFRAG demonstrates that exploiting the inherent sparsity of RAG contexts and the block‑diagonal attention pattern enables dramatic reductions in latency and memory usage while preserving or improving perplexity. The framework provides a practical, scalable solution for latency‑sensitive, knowledge‑intensive applications of large language models.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
