AI Carb Counting Fails Reproducibility: 27K Queries Show 429g Spread on One Photo

✍️ OpenClawRadar📅 Published: April 29, 2026🔗 Source
AI Carb Counting Fails Reproducibility: 27K Queries Show 429g Spread on One Photo
Ad

A newly published preprint tested four AI models — OpenAI GPT-5.4, Anthropic Claude Sonnet 4.6, Google Gemini 2.5 Pro, and Google Gemini 3.1 Pro — on a simple task: estimate carbohydrates from photos of food. The same 13 photos, the same prompt, the same settings, repeated 500+ times per model (26,904 total queries). Results show that even at the lowest randomness setting, reproducibility is wildly inconsistent across models.

Key Findings

  • Worst-case spread: Gemini 2.5 Pro’s estimates for a single paella photo ranged from 55g to 484g — a 429g difference. At a 1:10 insulin-to-carb ratio, that’s 42.9 units of insulin. A potential fatality.
  • Median variation (CV): Claude 2.4%, GPT-5.4 8.4%, Gemini 3.1 Pro 10.3%, Gemini 2.5 Pro 11.0%.
  • Median insulin swing: Claude 0.9U, GPT-5.4 2.3U, Gemini 3.1 Pro 2.9U, Gemini 2.5 Pro 4.7U.
  • Worst-case insulin swing: Claude 13.6U, GPT-5.4 16.6U, Gemini 3.1 Pro 16.2U, Gemini 2.5 Pro 42.9U.

The “Precisely Wrong” Problem

Three models (Claude, Gemini 2.5 Pro, Gemini 3.1 Pro) independently converged on ~28g for a cheese sandwich with a reference value of 40g (packet label: 20g per slice of bread). Claude showed just 0.3% CV across 510 queries, yet every single query was 12g low — a consistent underdose of ~1.2U. GPT-5.4 swung the other way, averaging ~74g with high variability.

Ad

Food Identification Errors

  • Bakewell tart: Claude called it “Linzer torte” 100% of the time. GPT-5.4 called it “jam tart” or “cake bar.” Only Gemini 3.1 Pro correctly identified it (99.8%).
  • Crema catalana: Three of four models called it “crème brûlée” 100% of the time. Gemini 3.1 Pro got it right only 3.4% of queries.
  • Cheese sandwich: Gemini 3.1 Pro hallucinated “deli meat” in 17.4% of queries — potentially inflating carb estimates.

Insulin Dosing Risk

On five images with strong reference values, Claude was the only model with zero queries in the “clinically significant” (2-5U error) or “severe hypo risk” (>5U error) zones. 100% of Claude’s queries landed in safe or moderate zones. The other models produced dangerous outliers with every image.

Bottom line: a single number from any AI carb-counting app gives users no visibility into the underlying distribution of estimates. High consistency (Claude) does not guarantee accuracy. Low consistency (Gemini) can produce any result. Production systems must account for this variance.

📖 Read the full source: HN AI Agents

Ad

👀 See Also