How REFRAG Cuts LLM Decoding Time by 30×: A New Efficient RAG Framework

REFRAG (REpresentation For RAG) introduces a novel decoding framework that compresses, senses, and expands context using precomputed chunk embeddings, achieving up to 30.85× faster first-token generation and 16× larger context windows without sacrificing perplexity, as validated across diverse long‑context tasks.

Instant Consumer Technology Team
Instant Consumer Technology Team
Instant Consumer Technology Team
How REFRAG Cuts LLM Decoding Time by 30×: A New Efficient RAG Framework

Background

In Retrieval‑Augmented Generation (RAG) the LLM’s context is formed by concatenating retrieved paragraphs, most of which are irrelevant to the user query. This leads to a block‑diagonal attention pattern that differs from standard generation tasks, and many decoding computations are unnecessary.

REFRAG Overview

Researchers from Meta Superintelligence Labs, NUS and Rice University propose REFRAG (REpresentation For RAG) , an efficient decoding framework that compresses, senses and expands the RAG context. By exploiting a sparse attention structure, REFRAG accelerates time‑to‑first‑token (TTFT) by up to 30.85× (3.75× over previous work) without increasing perplexity, and expands the effective context window by 16×.

REFRAG main design
REFRAG main design

Key Innovations

Chunk‑Embedding Decoding : Instead of feeding raw tokens, REFRAG uses pre‑computed compressed chunk embeddings as an approximate representation of the retrieved context and feeds them directly to the decoder.

Three Advantages

Shorter decoder input length : Replacing tokens with chunk embeddings reduces sequence length and improves token allocation efficiency.

Reuse of retrieval computation : Chunk embeddings generated during retrieval can be reused, eliminating redundant encoding.

Reduced attention complexity : Attention cost changes from quadratic in token count to quadratic in chunk count.

Compress‑Anywhere : REFRAG can compress token blocks at any position while preserving the autoregressive property, supporting multi‑turn dialogue and agent‑style interactions.

Lightweight RL Strategy : A reinforcement‑learning policy decides when to use full tokens versus compressed chunk embeddings, minimizing reliance on expensive token‑level encoding.

Model Architecture

REFRAG consists of a decoder (e.g., LLaMA) and an encoder (e.g., RoBERTa). The input is split into a small number of query tokens followed by many context tokens. Context tokens are partitioned into equal‑size chunks, each encoded into a chunk embedding, projected to the decoder’s token‑embedding space, and then combined with the query tokens for answer generation.

REFRAG main design
REFRAG main design

Performance Gains

Three metrics are evaluated: TTFT, TTIT (time‑to‑iterative‑token) and throughput. REFRAG achieves 16.53× TTFT speedup with cache (8.59× without) and up to 6.78× throughput improvement over LLaMA, while maintaining or improving perplexity.

Inference acceleration validation
Inference acceleration validation

Training Methodology

REFRAG uses a Continual Pre‑training (CPT) recipe that combines a reconstruction task with curriculum learning. The reconstruction task forces the encoder to compress a fixed number of tokens while the decoder learns to reconstruct them; the decoder is frozen during this phase. Curriculum learning gradually increases the number of chunks to reconstruct, easing optimization.

Selective Compression

An RL policy evaluates the perplexity of the next‑segment prediction and decides which chunks to keep uncompressed. This preserves important information and improves answer quality while still allowing compression elsewhere.

Experimental Results

Training data: 20 B tokens from SlimPajama Book and ArXiv. Evaluation on SlimPajama hold‑outs, PG‑19 and Proof‑Pile. Baselines include LLaMA‑2‑7B variants, CEPE, REPLUG, and full‑context LLaMA. REFRAG consistently outperforms baselines across standard (s = 2048) and extended (up to 16384) context lengths, showing strong extrapolation ability.

Perplexity vs. context length
Perplexity vs. context length
Perplexity vs. context size
Perplexity vs. context size

Selective Compression Ablation

RL‑based selective compression outperforms perplexity‑based heuristics and random selection across all compression rates.

Selective compression performance
Selective compression performance

RAG and Multi‑turn Dialogue

Fine‑tuned REFRAG models achieve higher accuracy than LLaMA under equal latency or equal number of retrieved passages, especially in weak‑retriever scenarios and long‑context dialogue where LLaMA’s 4 k token window truncates history.

RAG performance with strong retriever
RAG performance with strong retriever
RAG performance with weak retriever
RAG performance with weak retriever
Multi‑turn RAG performance
Multi‑turn RAG performance

Related Work

Prior efficient long‑context LLMs modify attention (e.g., compressive attention) or prompt strategies (e.g., attention sinks). REFRAG is complementary, offering pre‑computed chunk embeddings that can be used anywhere in the prompt.

Conclusion

REFRAG demonstrates that exploiting the inherent sparsity of RAG contexts and the block‑diagonal attention pattern enables dramatic reductions in latency and memory usage while preserving or improving perplexity. The framework provides a practical, scalable solution for latency‑sensitive, knowledge‑intensive applications of large language models.

LLMRAGlong contextreinforcement learningchunk embeddingsefficient decoding
Instant Consumer Technology Team
Written by

Instant Consumer Technology Team

Instant Consumer Technology Team

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.