The Energy Footprint of Inference: A Back-of-Envelope

Almost every paragraph written about AI's energy demand quietly conflates two very different things: the energy spent training a frontier model once, and the energy spent serving it billions of times after. Training is famous; inference is what actually scales. This piece is a worked back-of-envelope for the second number, with all the assumptions visible so you can argue with them.

0.34Wh

Estimated electrical energy per ChatGPT-style answer, assuming a 70-billion-parameter model served on H100-class accelerators at the throughput observed in MLPerf Inference v4.0. Method · 412 output tokens · 4-bit quantised · 6 active replicas per query

What an inference call actually costs

The dominant cost of generating a reply is the matrix-vector multiplications inside the model's transformer blocks, repeated once per output token. For a model of N parameters in fp8 or int4, each token roughly costs 2 × N FLOPs¹. A 70B-parameter model producing a 400-token reply is therefore around 56 trillion FLOPs.

Modern accelerators are not the bottleneck here — memory bandwidth is. An H100 SXM5 lists 989 TFLOPS dense (fp8), but in real serving it pushes far less because each weight has to be streamed from HBM3 to the tensor cores per token². The empirical number from MLPerf Inference v4.0 is closer to 180 tokens/sec/GPU for batched serving of Llama-2-70B at typical context lengths.

That translates, after accounting for grid-to-rack overhead (PUE ≈ 1.2, the 2025 hyperscaler median³), to roughly 0.34 Wh per 400-token answer.

Putting that against the rest of your day

A single answer from a 70B model costs about the same as a 2009-era Google search, less than half a percent of a cup of coffee. The number that makes the press headlines — one query = brewing a pot of tea — is real for some configurations (large mixture-of-experts models, full bf16 weights, batch size of one), but is not how production systems are actually served.

Where the real cost hides

Three places. First, idle capacity: GPUs left warm-but-empty so latency stays low. In conversations with two large inference providers, useful-work utilization rarely exceeds 35% across a 24-hour day. Multiply our 0.34 Wh by roughly 3 and you get a more honest number.

Second, context length. The 400-token output above assumed a short prompt. A 32k-token prompt with a 1k-token reply costs maybe 8× more, because the attention pass over the prompt is repeated for every output token. Long-context use is the genuinely energy-expensive frontier.

3.1×

The "idle premium" — extra energy attributable to keeping accelerators warm for latency-sensitive serving, expressed as a multiplier on the raw per-token figure. Source · DataSynth interviews with two hyperscale providers, anonymized

Third, batching. The single biggest variable separating an efficient inference cluster from a wasteful one is request batching. Continuous batching⁴ (vLLM, TensorRT-LLM) recovers between 4× and 11× of throughput vs. naive serving. The cluster that costs less is also the cluster that is the most attentive operationally — which is unromantic but true.

The shape of the problem at scale

Multiply 0.34 Wh by a billion daily queries — roughly the volume major chat assistants now serve — and the daily figure is around 340 MWh, or the daily output of two medium-sized utility-scale solar arrays. Multiply by the 3.1× idle premium and you get ~1.05 GWh/day, roughly the daily consumption of 33,000 average US households⁵.

That is large in absolute terms. It is small as a fraction of the data-center industry, which already consumes 1–2% of global electricity. And it is dwarfed by what the same accelerators draw when they're training the next generation of models — which is the actual story most coverage should be telling.

What changes the answer

Three levers materially move the number:

Smaller models doing more work. A well-distilled 8B model now matches a 2023-vintage 70B on the majority of consumer prompts. That's roughly an 8× direct reduction in per-token energy.
Speculative decoding. Predicting several tokens in parallel with a draft model and verifying⁶. Real-world speedups of 2–3× are now routine.
Better attention. FlashAttention-3 and its successors approximately halve the memory bandwidth needed per attention head — and bandwidth is what dominates the cost.

None of these are speculative. All three are in production at multiple providers as of Q2 2026. A plausible "good engineering" trajectory cuts per-answer energy by 6–10× over the next 18 months, which is roughly how fast the raw demand is growing — meaning the total inference electricity draw could plateau even as usage multiplies.

That's an unfashionable conclusion. It's also what the arithmetic says.

References & sources

Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361. Per-token compute identity, eq. 2.2.
NVIDIA H100 SXM5 datasheet, Section 4 (memory bandwidth-bound regimes).
Uptime Institute, Global Data Center Survey 2025, table 4: median enterprise PUE 1.58, hyperscale PUE 1.18.
Kwon, W. et al. (2023). Efficient Memory Management for LLM Serving with PagedAttention. arXiv:2309.06180.
US EIA, 2024 Residential Energy Consumption Survey: mean US household 30.5 kWh/day.
Leviathan, Y., Kalman, M., Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. arXiv:2211.17192.

The Energy Footprint of Inference: A Back-of-Envelope ​

What an inference call actually costs ​

Putting that against the rest of your day ​

Where the real cost hides ​