How Google Cuts Gemini’s AI Energy Use to Microwatt Levels
Google reveals that a single Gemini query now consumes only 0.24 Wh of electricity, emits 0.03 g CO₂e and uses about five drops of water, thanks to a comprehensive measurement framework and aggressive optimizations across model architecture, quantization, hardware design, and data‑center operations.
Measuring AI Energy Consumption
Google’s measurement framework evaluates the real‑world energy use of AI services by accounting for the entire system rather than theoretical peak efficiency. The calculation includes:
Full‑system dynamic power : energy consumed by the model inference plus supporting infrastructure, scaled by actual chip utilization.
Idle capacity : energy drawn by pre‑provisioned TPU/CPU resources that remain idle but must be available for traffic spikes or fail‑over.
Host CPU and memory : power used by the servers that host the accelerators and manage data movement.
Data‑center overhead : cooling, power distribution, and other facility loads, expressed via Power‑Usage‑Effectiveness (PUE).
Water usage : water required for cooling, which decreases as overall efficiency improves.
Applying this methodology, a single Gemini inference consumes 0.24 Wh , emits 0.03 g CO₂e , and uses roughly five drops of water (≈0.12 ml) .
Why Gemini’s Consumption Is So Low
More efficient model architecture : Gemini’s transformer design, augmented with Mixture‑of‑Experts (MoE) and hybrid inference, delivers 10‑100× higher compute efficiency than earlier language‑model architectures.
Accurate Quantization Training (AQT) : Precise quantization reduces the bit‑width of weights and activations, cutting compute and data‑movement costs without degrading answer quality.
Distillation : Large teacher models are used to train smaller, high‑performance variants such as Gemini Flash and Flash‑Lite, lowering per‑query energy.
Speculative decoding : A lightweight model generates a draft answer that a larger model quickly verifies, reducing total accelerator usage.
Custom TPU hardware : The Ironwood‑generation TPUs achieve ~30× better energy‑per‑watt performance than the first public TPU, and hardware‑software co‑design ensures models fully exploit accelerator capabilities.
Dynamic resource scheduling : The serving stack allocates CPUs and TPUs in near‑real‑time based on demand, minimizing idle power.
Advanced compilation : XLA, Pallas kernels, and the Pathways system compile JAX‑expressed models efficiently for TPU execution.
Google’s data‑center PUE is 1.09, one of the industry’s best, and the company operates with 24/7 carbon‑free renewable energy, aiming to offset 120 % of average water consumption. Combined hardware, software, and operational improvements have reduced Gemini’s energy per inference by a factor of 33 and its carbon emissions by a factor of 44 compared with earlier versions.
Reference URLs:
https://x.com/JeffDean/status/1958525015722434945
https://cloud.google.com/blog/products/infrastructure/measuring-the-environmental-impact-of-ai-inference/
Code example
来源:人工智能前沿讲习
本文
约1800字
,建议阅读
5
分钟
本文介绍了
大模型能耗。Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
