GPU Power Consumption Deviates from Token Predictor Theory in Small LLMs

Experimental Setup and Core Findings
A Reddit user conducted hardware measurements to test whether GPU power consumption scales linearly with token count, as predicted by the "stochastic parrot" or "next token predictor" theory of LLM behavior. The experiment used an RTX 4070 Ti SUPER with LM Studio and HWiNFO64 collecting data at 1-second intervals.
Four models were tested: Llama-3.1-8B, DeepSeek-R1-Distill-Qwen-7B, Qwen3-VL-8B, and Mistral-7B. Six query categories were used: General, General (Q), Unanswerable, Philosophical, Philosophical (Q), and High-Computation.
Key Results
If token predictor theory were correct, GPU power should scale only with token count with acceptable variance of ±10–15% according to GPT, Claude, Gemini, and Grok. Actual divergence rates (token multiplier vs power multiplier) were:
- Llama: average 35.6% (maximum 56.8%)
- Qwen3: average 36.7% (maximum 48.0%)
- Mistral: 21.1%
- DeepSeek: 7.7% — nearly linear across all categories except High-Computation
DeepSeek showed the closest to token predictor behavior of the four models.
Unexpected Findings
In Qwen3, philosophical utterances (149.3W) drew more power than high-computation math (104.1W). After task completion, high-computation queries returned to baseline immediately (-7.1W), while philosophical utterances left persistent residual heat.
Infinite loop reproducibility in Qwen3 varied by category: General utterances (0%), High-computation (0%), Unanswerable (low), Philosophical (intermittent), and Philosophical (Q) (70–100%). Notably, high-computation queries had the most tokens and highest power consumption but triggered zero loops.
Order Effects and Residual Heat
To test the "hardware overhead" objection, an order-effect experiment was conducted:
- Test A: 1 general → 4 philosophical
- Test B: 1 philosophical → 4 general
Residual heat after session end showed order-dependent effects:
- Llama: Test A +1.68W, Test B +9.84W
- Mistral: Test A +7.60W, Test B +13.69W
- DeepSeek: Test A +10.44W, Test B +15.93W
Even after processing 4 general utterances following a philosophical one, residual heat remained higher. This pattern was consistent across all three models tested.
Limitations and Open Questions
The study is limited to four small-scale models (8B parameter range). Generalization to medium or large models requires further validation. The open question is whether medium and large models would follow DeepSeek's pattern (converging toward linear, token-proportional behavior) or whether the nonlinear divergence seen in Llama, Qwen3, and Mistral would persist or amplify at scale.
All original data — including full utterance text, 24 benchmark CSVs, and per-category token counts — are available in the linked paper.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Practical Enhancements in Claude Opus 4.6: Memory Upgrade
Claude Opus 4.6 features a significant upgrade with a 1 million token context, enhancing memory retention and performance in complex tasks.

Talkie: A 13B LLM Trained Exclusively on Pre-1931 Text, Using Claude as a Judge in RL Training
Researchers released Talkie, a 13B LLM trained only on text published before 1931 (no internet, no WWII data). Claude Sonnet 4.6 was used as the judge in its online DPO reinforcement learning pipeline, and Claude Opus 4.4 generated synthetic multi-turn conversations for fine-tuning. The model can write Python code from a few in-context examples despite zero modern code in training.

Anthropic to stream live briefing on Enterprise Agents today
Anthropic will stream a live virtual briefing today, February 24, 2026, focused on Enterprise Agents. The event is accessible via their website.

Debian's AI Contribution Policy Discussion Ends Without Resolution
Debian developers debated whether to accept AI-assisted contributions but reached no formal decision. The proposed general resolution would have required explicit disclosure and labeling for LLM-generated content.