GPU Power vs Token Predictor: 36.7% Divergence in Small LLMs

Experimental Setup and Core Findings

A Reddit user conducted hardware measurements to test whether GPU power consumption scales linearly with token count, as predicted by the "stochastic parrot" or "next token predictor" theory of LLM behavior. The experiment used an RTX 4070 Ti SUPER with LM Studio and HWiNFO64 collecting data at 1-second intervals.

Four models were tested: Llama-3.1-8B, DeepSeek-R1-Distill-Qwen-7B, Qwen3-VL-8B, and Mistral-7B. Six query categories were used: General, General (Q), Unanswerable, Philosophical, Philosophical (Q), and High-Computation.

Key Results

If token predictor theory were correct, GPU power should scale only with token count with acceptable variance of ±10–15% according to GPT, Claude, Gemini, and Grok. Actual divergence rates (token multiplier vs power multiplier) were:

Llama: average 35.6% (maximum 56.8%)
Qwen3: average 36.7% (maximum 48.0%)
Mistral: 21.1%
DeepSeek: 7.7% — nearly linear across all categories except High-Computation

DeepSeek showed the closest to token predictor behavior of the four models.

Unexpected Findings

In Qwen3, philosophical utterances (149.3W) drew more power than high-computation math (104.1W). After task completion, high-computation queries returned to baseline immediately (-7.1W), while philosophical utterances left persistent residual heat.

Infinite loop reproducibility in Qwen3 varied by category: General utterances (0%), High-computation (0%), Unanswerable (low), Philosophical (intermittent), and Philosophical (Q) (70–100%). Notably, high-computation queries had the most tokens and highest power consumption but triggered zero loops.

Order Effects and Residual Heat

To test the "hardware overhead" objection, an order-effect experiment was conducted:

Test A: 1 general → 4 philosophical
Test B: 1 philosophical → 4 general

Residual heat after session end showed order-dependent effects:

Llama: Test A +1.68W, Test B +9.84W
Mistral: Test A +7.60W, Test B +13.69W
DeepSeek: Test A +10.44W, Test B +15.93W

Even after processing 4 general utterances following a philosophical one, residual heat remained higher. This pattern was consistent across all three models tested.

Limitations and Open Questions

The study is limited to four small-scale models (8B parameter range). Generalization to medium or large models requires further validation. The open question is whether medium and large models would follow DeepSeek's pattern (converging toward linear, token-proportional behavior) or whether the nonlinear divergence seen in Llama, Qwen3, and Mistral would persist or amplify at scale.

All original data — including full utterance text, 24 benchmark CSVs, and per-category token counts — are available in the linked paper.

📖 Read the full source: r/LocalLLaMA