ThermoQA Benchmark: 293 Problems Test LLMs on Thermodynamics

ThermoQA Benchmark Overview

ThermoQA is an open benchmark for engineering thermodynamics with 293 open-ended calculation problems across three tiers:

Tier 1: Property lookups (110 questions) — Example: "what is the enthalpy of water at 5 MPa, 400°C?"
Tier 2: Component analysis (101 questions) — Turbines, compressors, heat exchangers with energy/entropy/exergy calculations
Tier 3: Full cycle analysis (82 questions) — Rankine, Brayton, combined-cycle gas turbines

Ground truth comes from CoolProp (IAPWS-IF97). No multiple choice — models must produce exact numerical values.

1. Claude Opus 4.6: Tier 1: 96.4%, Tier 2: 92.1%, Tier 3: 93.6%, Composite: 94.1%
2. GPT-5.4: Tier 1: 97.8%, Tier 2: 90.8%, Tier 3: 89.7%, Composite: 93.1%
3. Gemini 3.1 Pro: Tier 1: 97.9%, Tier 2: 90.8%, Tier 3: 87.5%, Composite: 92.5%
4. DeepSeek-R1: Tier 1: 90.5%, Tier 2: 89.2%, Tier 3: 81.0%, Composite: 87.4%
5. Grok 4: Tier 1: 91.8%, Tier 2: 87.9%, Tier 3: 80.4%, Composite: 87.3%
6. MiniMax M2.5: Tier 1: 85.2%, Tier 2: 76.2%, Tier 3: 52.7%, Composite: 73.0%

Rankings flip between tiers: Gemini leads Tier 1 (97.9%) but drops to #3 on Tier 3 (87.5%). Opus is #3 on lookups but #1 on cycle analysis, showing that memorizing steam tables ≠ reasoning.
Supercritical water breaks everything: 44.5 percentage point spread. Models memorize textbook tables but can't handle nonlinear regions near the critical point. One model gave h = 1,887 kJ/kg where the correct value is 2,586 kJ/kg — a 27% error.
R-134a is the blind spot: All models collapse to 44–63% on refrigerant problems vs 75–98% on water, showing training data bias.
Run-to-run consistency varies 10×: GPT-5.4 σ = ±0.1% on Tier 3 vs DeepSeek-R1 σ = ±2.5% on Tier 2.