How Fast‑Apply AI Models Revolutionize Code Editing with Speculative Decoding
This article explains the design of the edit_file tool, the fast‑apply model that rewrites whole files instead of diffs, its training and evaluation methodology, speculative decoding speed gains, and future research directions for large‑scale code‑editing AI systems.
edit_file Example
The edit_file tool accepts three parameters: target_file , instructions , and code_edit . An example call specifies the file cursor-reverse/src/types.ts, an instruction to update the LogEntry interface, and a code_edit format defined in the tool definition.
export interface LogEntry {
timestamp: string;
requestId: string;
requestHeaders?: Record<string, any>;
request: OpenRouterRequest;
response?: OpenRouterResponse;
responseHeaders?: Record<string, any>;
error?: string;
tools?: Tool[]; // Added tools
model?: string; // Added model
messages?: ChatMessage[]; // Added messages
prompt?: string; // Added prompt
}The Instructions parameter is needed because edit_file itself is implemented by a large model and benefits from a well‑crafted prompt.
Background
Front‑line models such as GPT‑4o struggle with large code edits, exhibiting laziness, inaccuracy, and high latency. Editing hundreds of lines often requires many model calls and can lead to infinite loops or slow performance, making a fast, reliable "apply" step essential.
We trained a dedicated fast‑apply model that treats code editing as a two‑stage process: planning (via a chat interface with a strong model) and implementation (instant file rewrite).
Figure 2 shows a full‑class change that is difficult to copy‑paste directly.
Our fast‑apply model outperforms GPT‑4 and GPT‑4o on the accuracy‑latency Pareto curve, achieving roughly 1,000 tokens/second (≈3,500 characters/second) on a 70B parameter model—about 13× faster than standard Llama‑3‑70B inference and 9× faster than our previous GPT‑4 speculative‑edit system.
By default, the language model rewrites the file using the current file, dialogue history, and current code block.
Evaluating Prompted Rewrites
We built an evaluation set of ~450 files (each < 400 lines) and used Claude‑3 Opus as a scorer to compare several prompted models. Claude‑3 Opus scores aligned more closely with internal human ratings than GPT‑4 Turbo or GPT‑4o.
Figure 4 shows the scoring prompt component; Figure 5 indicates Claude‑3‑Sonnet outperforms GPT‑4‑Turbo, while GPT‑4o matches GPT‑4‑Turbo.
Speed Measurements
Speed is measured as rewritten characters / rewrite time (seconds) , which normalizes across tokenizers, provides a single metric for various prompt lengths, and offers a reliable lower bound for generation speed.
Standardizes speed across different tokenizers.
Gives a single number for various prompt/token lengths.
Since latency includes first‑token time, dividing characters/second by ~4 yields a token/second lower bound.
Diff Models
We rewrite entire files rather than output diffs because diff‑based editing suffers from token‑count limitations, out‑of‑distribution issues, and line‑number handling problems. Inspired by Aider, we use a "search‑replace" format instead of standard diffs.
@@ ... @@
-function binarySearch(arr, x) {
- let low = 0, high = arr.length - 1;
- while (low <= high) {
- let mid = Math.floor((low + high) / 2);
- if (arr[mid] === x) {
- return mid;
- }
- low += 1;
- }
- return -1;
+let low = 0, high = arr.length - 1;
+while (low <= high) {
+ let mid = Math.floor((low + high) / 2);
+ if (arr[mid] === x) {
+ return mid;
+ } else if (arr[mid] < x) {
+ low = mid + 1;
+ } else {
+ high = mid - 1;
+ }
+}
+return -1;
}We replace lines beginning with - or + to make the diff parser robust to minor model errors; Claude Opus is the only model that consistently produces accurate diffs.
Training
We generated synthetic data from a small set of "fast‑apply" prompts and a large set of cmd‑k prompts. Cmd‑k prompts contain edit instructions and a selected code region; we let GPT‑4 generate dialogue responses for each instruction, then apply the edits. A small amount of real‑world data was mixed (80/20) to form the fine‑tuning dataset.
We trained Deepseek Coder Instruct and Llama‑3 families, applying undersampling strategies: reducing very short files (<100 lines), limiting examples per filename, and discarding no‑op edits.
The best model (llama‑3‑70b‑ft) matches Claude‑3‑Opus‑Diff performance and surpasses GPT‑4‑Turbo and GPT‑4o. All three fine‑tuned models beat GPT‑4‑Turbo, though the gap between deepseek‑33b and llama‑3‑70b remains noticeable.
Speculative Edits
Our biggest breakthrough is a custom speculative‑decoding algorithm called "speculative edits," which is equivalent to full‑file rewrite but yields up to 9× speedup. By treating draft tokens as strong priors, we deterministically guess future tokens without a separate draft model.
Partnering with Fireworks, we deployed the fast‑apply model on their inference engine with a custom API for speculative logic, giving llama‑3 a 4–5× advantage over the next fastest model.
Future Directions
Long‑context training : Aim to rewrite files up to 2,500 lines; linear RoPE extensions have not yielded good results.
Knowledge distillation : Transfer fast‑apply capability to smaller models (e.g., llama‑3‑8b) for lower latency on large files.
Higher accuracy : Apply on‑policy RL using real‑world data to further boost performance.
Fast‑apply is a crucial module beyond chat, enabling more complex code‑generation systems as model planning abilities improve.
Additional Challenges / Challenges
As an exercise, implement speculative decoding using the OpenAI API (compatible with davinci‑002 or babbage‑002). For a harder challenge, implement it in vLLM or TensorRT‑LLM.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
