LLM Risk: What the Chat Box Hides

Large language models are powerful, but they are not free, neutral, or harmless by default. A simple prompt can hide four things: expensive infrastructure, energy and cooling demand, privacy and security exposure, and the risk of confident wrong answers. This page helps you see the real process underneath: training builds the model, inference serves every prompt, and responsible use means checking cost, data, accuracy, and dependence.

Training cost Building frontier-scale models can require millions of GPU-hours before any user sends a prompt.

Inference cost Every chat, summary, or code request uses serving hardware, memory, networking, and electricity.

Information risk Private data, biased outputs, hallucinations, and over-trust can create real harm.

Good practice Use LLMs where they add value, verify important outputs, and avoid unnecessary repeated calls.

Pricing checked against OpenAI API pricing on May 19, 2026. Some values below are labeled as derived estimates.

1. The Physical Cost

LLMs feel like software, but they run on physical infrastructure: GPUs, memory, storage, power, cooling, and networks. These examples show the scale.

Power draw per top-end GPU

Official

700 W

NVIDIA lists the H100 SXM with up to 700W TDP and 80 GB memory. Even one 8-GPU server means 5.6 kW for GPUs alone, before CPUs, networking, storage, and cooling.

Training scale

Model card

30.84M H100 GPU-hours

Meta reports Llama 3.1 405B used 30.84 million H100 GPU-hours for training, with 15T+ pretraining tokens. That is industrial-scale compute, not desktop-scale computing.

Cooling water

Research

700,000 L direct
5.4M L total

The 2023 water-footprint paper estimated GPT-3 training in Microsoft U.S. data centers could directly consume 700,000 liters of freshwater, and about 5.4 million liters total when indirect water is included.

Model storage

Derived

~810 GB weights

A 405B-parameter model stored at FP16 needs about 810 GB just for weights. Replicas, checkpoints, optimizer state, KV cache, and backup copies push real storage and memory requirements much higher.

Training budget

Epoch AI

$100M+

Epoch AI reports that the most advanced models now cost hundreds of millions of dollars to train, with about half of that spend on GPUs and the rest on other hardware and energy.

Inference can dominate

HotCarbon

25x training emissions

Chien et al. estimate that a ChatGPT-like service at 11 million requests per hour could generate 12.8k metric tons CO2 per year, about 25 times the emissions of training GPT-3 once.

Strong takeaway: the interface is simple, but the system behind it is large. Cost does not disappear because the prompt box is clean.

2. The Everyday Cost

Training is expensive, but repeated inference is where everyday usage becomes a recurring bill. Change the numbers to see the effect.

Model tier

Pricing snapshot checked on May 19, 2026. GPT-5.5 is the expensive frontier comparison; GPT-5.4 is the recommended balanced general-use comparison.

Daily requests

100,000

Levels: 1k, 10k, 100k, 1M, 10M requests per day.

Average input tokens per request

Default example: a moderately detailed prompt or short conversation turn.

Average output tokens per request

Default example: a concise answer, not a very long report.

Estimated API cost per day

Estimated API cost per month

Estimated API cost per year

Tokens per day

Parameter Estimate for These Three Tiers

GPT-5.4 mini: exact parameter count is not publicly disclosed. A reasonable teaching estimate is tens of billions of parameters, roughly 20B-80B.
GPT-5.4: exact parameter count is not publicly disclosed. A reasonable teaching estimate is hundreds of billions of parameters or an MoE-class system with comparable effective scale, roughly 200B+.
GPT-5.5: exact parameter count is not publicly disclosed. It is included here as the current expensive frontier tier, so treat its scale as larger or more compute-intensive than GPT-5.4, not as a known parameter count.
These are inferred ranges, not official OpenAI numbers. They are included here to help readers connect model tier with likely memory, storage, and infrastructure scale.

Visible inference bill 0%

Power and data-center pressure 0%

Lock-in and operating risk 0%

A moderate-volume workload already produces a real recurring bill. The hidden part is that the API bill is only one layer; the underlying power, cooling, storage, and capacity footprint is larger still.

3. Why the Risk Stays Hidden

What most users see

A chat box and a quick answer.
A simple subscription or token price.
No direct view of GPU clusters or cooling systems.
No obvious sign of storage replication, checkpointing, or traffic spikes.

What sits underneath

High-end accelerator hardware with large power draw.
Cooling water or equivalent cooling infrastructure.
Large model weights, caches, checkpoints, and replicas.
Recurring inference traffic that can outweigh one-time training impacts over time.

Main teaching point

Training is expensive, but repeated inference at scale can be even more expensive over time.
Water and power matter because LLM infrastructure is physical, not magical.
Model size affects not just quality, but memory, storage, cooling, and cost.
Best practice: use LLMs where they add high value, then reuse outputs locally when possible.

4. Use LLMs Deliberately

A strong LLM workflow is not "never use AI". It is: use it where it helps, protect sensitive data, verify important claims, and control repeated cost.

Check the data

Do not paste private, confidential, medical, financial, legal, or student-identifiable data unless the system is approved for that use.

Check the answer

LLMs can sound certain when they are wrong. Verify facts, citations, calculations, and code before using the output.

Check the bias

Training data can contain stereotypes or gaps. Review outputs for unfair assumptions, missing perspectives, and cultural context.

Check the cost

Repeated prompts, long context, and long answers multiply token use. Cache, reuse, summarise, and batch where possible.

Check the dependence

If a workflow only works with one provider or one large model, there is lock-in risk. Keep exports, fallbacks, and human knowledge.

Check the value

Use the smallest capable model and the shortest useful prompt. The best prompt is not always the biggest prompt.

Sources and Notes

NVIDIA H100 specs: up to 700W TDP and 80 GB memory. nvidia.com
Meta Llama 3.1 405B model card: 30.84M H100 GPU-hours, 15T+ pretraining tokens, 700W hardware reference. build.nvidia.com
Epoch AI, June 19 2024: the most advanced models now cost hundreds of millions of dollars to train. epoch.ai
Li et al., 2023, "Making AI Less 'Thirsty'": GPT-3 training estimated at 700,000 L direct water and ~5.4M L total water footprint; about 500 mL of water for 10-50 prompts depending on where and when inference runs. arxiv.org
Chien et al., HotCarbon 2023: a ChatGPT-like service at 11M requests/hour estimated at 12.8k metric tons CO2/year and about 25x the emissions of training GPT-3 once. hotcarbon.org
OpenAI API pricing checked May 19, 2026: GPT-5.5 input $5.00/M tokens, output $30/M; GPT-5.4 input $2.50/M, output $15/M; GPT-5.4 mini input $0.75/M, output $4.50/M. openai.com/api/pricing
Derived estimates on this page: 21.6 GWh for Llama 3.1 405B training GPU draw is from 30.84M GPU-hours x 0.7 kW; 810 GB model storage is from 405B parameters x 2 bytes/parameter at FP16.
Parameter estimates for GPT-5.4 mini, GPT-5.4, and GPT-5.5 are not official. They are teaching-oriented inference ranges based on public pricing tiers, capability tiering, and current frontier-model scale patterns.