Google’s Second Sword: Accelerating LLM Inference with Speculative Decoding and Cascades

The article analyzes Google’s shift from scaling‑law to efficiency‑law, detailing how speculative decoding, language‑model cascades, distillation, CALM, accurate quantized training, and the Mixture‑of‑Recursions architecture together form a multi‑layered strategy to cut inference cost, boost throughput, and sustain the company’s AI moat.

AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Google’s Second Sword: Accelerating LLM Inference with Speculative Decoding and Cascades

From Scaling Law to Efficiency Law

Google frames AI as the foundational layer for all services, emphasizing that inference efficiency is the primary economic moat. The strategic goal is to operate the largest model platform with the lowest per‑token cost for the most users.

Key Drivers of Efficiency

Competitive focus has shifted from training the biggest model to delivering the biggest model at the lowest cost and highest reliability.

Inference accounts for over 70% of large‑model operating expenses, so acceleration directly impacts profitability.

Algorithmic Advances

Speculative Decoding – Independently invented by Google and DeepMind in 2022, this method uses a small model to predict future tokens and only invokes the large model when the prediction confidence is insufficient. The technique achieves zero precision loss while increasing inference speed.

Language‑Model Cascades – Presented in the 2022 paper “Language Model Cascades” (arXiv:2207.10342) and the accompanying repository https://github.com/google-research/cascades. Large models are treated as composable functions and combined with patterns such as Scratchpads, Chain‑of‑Thought, semi‑supervised learning, selection‑inference, verifiers, and tool‑use to improve performance.

Speculative Cascade – By the end of 2024 Google merged speculative decoding with cascade methods, releasing a hybrid approach described at

https://research.google/blog/speculative-cascades-a-hybrid-approach-for-smarter-faster-llm-inference/

. Confirmation mechanisms include confidence‑based selection, comparative confidence, top‑k pools, and cost‑benefit analysis.

Distillation Techniques – Building on Hinton’s early work, Google’s distillation uses softmax‑based continuous distribution feedback. Model parameters can be reduced by more than tenfold while preserving 90‑95% of original performance, yielding 5‑20× lower inference cost and latency, especially for mobile and edge devices.

Confident Adaptive Language Modeling (CALM) – Introduced in 2022, CALM skips full model evaluation for tokens with high confidence, delivering 2×‑3× throughput gains with negligible accuracy loss.

Accurate Quantized Training (AQT) and Qwix – The 2021 paper “Pareto‑Optimal Quantized ResNet Is Mostly 4‑bit” (arXiv:2105.03536) released the AQT JAX library ( https://github.com/google/aqt). In 2023 Google Cloud announced AQT for TPU v5e (

https://cloud.google.com/blog/products/compute/accurate-quantized-training-aqt-for-tpu-v5e

). By 2025 Qwix ( https://github.com/google/qwix) superseded AQT for inference, while AQT remains the quantization backbone for training pipelines such as MaxText.

Mixture‑of‑Recursions (MoR) – Proposed in July 2025 by Google DeepMind, KAIST AI, and Mila (paper: “Mixture‑of‑Recursions: Learning Dynamic Recursive Depths for Adaptive Token‑Level Computation”). MoR combines recursive transformer weight sharing, token‑wise dynamic routing, and smart KV caching, achieving a 2× speedup, 50% memory reduction, and 1/3 parameter count while maintaining or exceeding baseline performance.

Impact and Outlook

These front‑line acceleration algorithms enable Google to preserve margins in search, advertising, and YouTube while scaling LLM services across products. The competitive frontier is moving toward system‑level co‑design of hardware, algorithms, and software to achieve low‑power, sub‑second, long‑context inference.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

QuantizationSpeculative DecodingInference AccelerationGoogle TPULanguage Model Cascades
AI2ML AI to Machine Learning
Written by

AI2ML AI to Machine Learning

Original articles on artificial intelligence and machine learning, deep optimization. Less is more, life is simple! Shi Chunqi

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.