How Google Cuts Gemini’s AI Energy Use to Microwatt Levels

Google reveals that a single Gemini query now consumes only 0.24 Wh of electricity, emits 0.03 g CO₂e and uses about five drops of water, thanks to a comprehensive measurement framework and aggressive optimizations across model architecture, quantization, hardware design, and data‑center operations.

Data Party THU
Data Party THU
Data Party THU
How Google Cuts Gemini’s AI Energy Use to Microwatt Levels

Measuring AI Energy Consumption

Google’s measurement framework evaluates the real‑world energy use of AI services by accounting for the entire system rather than theoretical peak efficiency. The calculation includes:

Full‑system dynamic power : energy consumed by the model inference plus supporting infrastructure, scaled by actual chip utilization.

Idle capacity : energy drawn by pre‑provisioned TPU/CPU resources that remain idle but must be available for traffic spikes or fail‑over.

Host CPU and memory : power used by the servers that host the accelerators and manage data movement.

Data‑center overhead : cooling, power distribution, and other facility loads, expressed via Power‑Usage‑Effectiveness (PUE).

Water usage : water required for cooling, which decreases as overall efficiency improves.

Applying this methodology, a single Gemini inference consumes 0.24 Wh , emits 0.03 g CO₂e , and uses roughly five drops of water (≈0.12 ml) .

Why Gemini’s Consumption Is So Low

More efficient model architecture : Gemini’s transformer design, augmented with Mixture‑of‑Experts (MoE) and hybrid inference, delivers 10‑100× higher compute efficiency than earlier language‑model architectures.

Accurate Quantization Training (AQT) : Precise quantization reduces the bit‑width of weights and activations, cutting compute and data‑movement costs without degrading answer quality.

Distillation : Large teacher models are used to train smaller, high‑performance variants such as Gemini Flash and Flash‑Lite, lowering per‑query energy.

Speculative decoding : A lightweight model generates a draft answer that a larger model quickly verifies, reducing total accelerator usage.

Custom TPU hardware : The Ironwood‑generation TPUs achieve ~30× better energy‑per‑watt performance than the first public TPU, and hardware‑software co‑design ensures models fully exploit accelerator capabilities.

Dynamic resource scheduling : The serving stack allocates CPUs and TPUs in near‑real‑time based on demand, minimizing idle power.

Advanced compilation : XLA, Pallas kernels, and the Pathways system compile JAX‑expressed models efficiently for TPU execution.

Google’s data‑center PUE is 1.09, one of the industry’s best, and the company operates with 24/7 carbon‑free renewable energy, aiming to offset 120 % of average water consumption. Combined hardware, software, and operational improvements have reduced Gemini’s energy per inference by a factor of 33 and its carbon emissions by a factor of 44 compared with earlier versions.

Reference URLs:

https://x.com/JeffDean/status/1958525015722434945

https://cloud.google.com/blog/products/infrastructure/measuring-the-environmental-impact-of-ai-inference/

Code example

来源:人工智能前沿讲习
本文
约1800字
,建议阅读
5
分钟
本文介绍了
大模型能耗。
Data Centerenergy efficiencyTPUGoogle GeminiAI sustainabilityAI energy
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.