ThermoQA: Open Benchmark for Engineering Thermodynamics Tests LLMs on 293 Calculation Problems

ThermoQA Benchmark Overview
ThermoQA is an open benchmark for engineering thermodynamics with 293 open-ended calculation problems across three tiers:
- Tier 1: Property lookups (110 questions) — Example: "what is the enthalpy of water at 5 MPa, 400°C?"
- Tier 2: Component analysis (101 questions) — Turbines, compressors, heat exchangers with energy/entropy/exergy calculations
- Tier 3: Full cycle analysis (82 questions) — Rankine, Brayton, combined-cycle gas turbines
Ground truth comes from CoolProp (IAPWS-IF97). No multiple choice — models must produce exact numerical values.
Leaderboard Results (3-run mean)
- 1. Claude Opus 4.6: Tier 1: 96.4%, Tier 2: 92.1%, Tier 3: 93.6%, Composite: 94.1%
- 2. GPT-5.4: Tier 1: 97.8%, Tier 2: 90.8%, Tier 3: 89.7%, Composite: 93.1%
- 3. Gemini 3.1 Pro: Tier 1: 97.9%, Tier 2: 90.8%, Tier 3: 87.5%, Composite: 92.5%
- 4. DeepSeek-R1: Tier 1: 90.5%, Tier 2: 89.2%, Tier 3: 81.0%, Composite: 87.4%
- 5. Grok 4: Tier 1: 91.8%, Tier 2: 87.9%, Tier 3: 80.4%, Composite: 87.3%
- 6. MiniMax M2.5: Tier 1: 85.2%, Tier 2: 76.2%, Tier 3: 52.7%, Composite: 73.0%
Key Findings
- Rankings flip between tiers: Gemini leads Tier 1 (97.9%) but drops to #3 on Tier 3 (87.5%). Opus is #3 on lookups but #1 on cycle analysis, showing that memorizing steam tables ≠ reasoning.
- Supercritical water breaks everything: 44.5 percentage point spread. Models memorize textbook tables but can't handle nonlinear regions near the critical point. One model gave h = 1,887 kJ/kg where the correct value is 2,586 kJ/kg — a 27% error.
- R-134a is the blind spot: All models collapse to 44–63% on refrigerant problems vs 75–98% on water, showing training data bias.
- Run-to-run consistency varies 10×: GPT-5.4 σ = ±0.1% on Tier 3 vs DeepSeek-R1 σ = ±2.5% on Tier 2.
Open-Source Resources
- Dataset: https://huggingface.co/datasets/olivenet/thermoqa
- Code: https://github.com/olivenet-iot/ThermoQA
📖 Read the full source: r/LocalLLaMA
👀 See Also

Reddit discussion highlights shift from chatbots to autonomous agents with local execution
A Reddit post distinguishes chatbots from autonomous agents using concrete examples and notes the trend toward local execution with models like LLaMA running on private workstations.

Stripe's Minions: One-Shot AI Coding Agents
Minions are Stripe's one-shot AI coding agents aiming to enhance developer productivity by leveraging end-to-end automation using LLMs.

Google: 75% of New Code Is AI-Generated, Code Migration 6x Faster with Agents
Google reports 75% of new code is AI-generated, up from 25% in 2024. A complex code migration completed 6x faster using Gemini agents. Engineers in some orgs have AI usage goals tied to performance reviews.

Claude Opus 4.7 Analysis: Top Intelligence but High Cost and Verbosity
Claude Opus 4.7 (Adaptive Reasoning, Max Effort) ranks #1 in intelligence among 133 models with a score of 57 on the Artificial Analysis Intelligence Index, but costs $5 per 1M input tokens and $25 per 1M output tokens, making it significantly more expensive than average.