AI Carb Counting Fails Reproducibility: 27K Queries Show 429g Spread on One Photo

A newly published preprint tested four AI models — OpenAI GPT-5.4, Anthropic Claude Sonnet 4.6, Google Gemini 2.5 Pro, and Google Gemini 3.1 Pro — on a simple task: estimate carbohydrates from photos of food. The same 13 photos, the same prompt, the same settings, repeated 500+ times per model (26,904 total queries). Results show that even at the lowest randomness setting, reproducibility is wildly inconsistent across models.
Key Findings
- Worst-case spread: Gemini 2.5 Pro’s estimates for a single paella photo ranged from 55g to 484g — a 429g difference. At a 1:10 insulin-to-carb ratio, that’s 42.9 units of insulin. A potential fatality.
- Median variation (CV): Claude 2.4%, GPT-5.4 8.4%, Gemini 3.1 Pro 10.3%, Gemini 2.5 Pro 11.0%.
- Median insulin swing: Claude 0.9U, GPT-5.4 2.3U, Gemini 3.1 Pro 2.9U, Gemini 2.5 Pro 4.7U.
- Worst-case insulin swing: Claude 13.6U, GPT-5.4 16.6U, Gemini 3.1 Pro 16.2U, Gemini 2.5 Pro 42.9U.
The “Precisely Wrong” Problem
Three models (Claude, Gemini 2.5 Pro, Gemini 3.1 Pro) independently converged on ~28g for a cheese sandwich with a reference value of 40g (packet label: 20g per slice of bread). Claude showed just 0.3% CV across 510 queries, yet every single query was 12g low — a consistent underdose of ~1.2U. GPT-5.4 swung the other way, averaging ~74g with high variability.
Food Identification Errors
- Bakewell tart: Claude called it “Linzer torte” 100% of the time. GPT-5.4 called it “jam tart” or “cake bar.” Only Gemini 3.1 Pro correctly identified it (99.8%).
- Crema catalana: Three of four models called it “crème brûlée” 100% of the time. Gemini 3.1 Pro got it right only 3.4% of queries.
- Cheese sandwich: Gemini 3.1 Pro hallucinated “deli meat” in 17.4% of queries — potentially inflating carb estimates.
Insulin Dosing Risk
On five images with strong reference values, Claude was the only model with zero queries in the “clinically significant” (2-5U error) or “severe hypo risk” (>5U error) zones. 100% of Claude’s queries landed in safe or moderate zones. The other models produced dangerous outliers with every image.
Bottom line: a single number from any AI carb-counting app gives users no visibility into the underlying distribution of estimates. High consistency (Claude) does not guarantee accuracy. Low consistency (Gemini) can produce any result. Production systems must account for this variance.
📖 Read the full source: HN AI Agents
👀 See Also

Richard Dawkins Believes His Claude AI Chatbot Is Conscious: The Claude Delusion on HN
Richard Dawkins reportedly believes his female AI chatbot (Claude) is conscious, sparking a HN discussion with 57 points and 66 comments.

Claude System Prompt Compliance Degrades in Long Conversations
Claude-based agents show degraded system prompt compliance after 40-50 messages, with formatting rules being ignored and constraints forgotten. The issue stems from system prompts competing with conversation history for attention weight in the context window.

Claude.ai Experiencing Elevated Errors and Login Issues
Claude.ai is reporting elevated errors affecting the platform, including login issues specifically for Claude Code. The incident was officially posted on March 11, 2026 at 17:19:35 UTC.

Understanding LLM Directive Weighting: Why Claude Sometimes Ignores Commands
A Reddit investigation reveals how Claude can ignore explicit instructions like "don't pattern match" when generating code reviews, demonstrating that LLM directives are weighted context rather than constraints.