Claude vs GPT-4o: Same Double Pendulum Prompt, Different Coordinate Conventions

A Reddit user ran the same double pendulum prompt through Claude and GPT-4o side by side using a shared host renderer and saw two completely different physical systems within seconds. The cause: each model chose a different convention for measuring theta.
Claude measured theta from the up vertical (theta=0 = arm pointing straight up), while GPT-4o measured from the down vertical (theta=0 = arm hanging straight down). The host renderer in public/workers/simulator-host.js simply reads info.theta1 and info.theta2 and draws the arms accordingly — no cosmetic differences. So the visual mismatch is a real physics mismatch.
Both conventions are technically valid. Most classical mechanics textbooks use theta from the down vertical because it makes the equilibrium point at theta=0 for small-angle approximations. But theta from the up vertical is also standard in many references. Claude committed to its convention consistently across equations of motion, initial conditions, and integration (Runge Kutta). GPT-4o used the other convention silently — it did not comment on its choice.
The user was working on Physics Bench, an open-source side-by-side benchmark where every model gets the same generation contract: function createSimulator(...) in lib/prompt.ts. The host owns all rendering; models only implement step, getInfo, and reset. Models never touch draw. So any visual difference between panels is guaranteed to come from a real difference in simulation logic, not rendering choices.
A unit test of the math would not have caught this. Both models produce correct physics for their chosen conventions. You only see the split when rendering them next to each other through the same drawing code. This underlines the importance of specifying coordinate conventions explicitly in prompts when the output is consumed by a fixed renderer.
See the full Reddit thread for code snippets and conversation inspector details.
📖 Read the full source: r/ClaudeAI
👀 See Also

Codestrap founders critique AI coding metrics and warn of quality issues
Codestrap founders argue AI coding tools are being measured incorrectly with metrics like lines of code and pull requests, while quality metrics show problems like a 3.7x larger codebase performing 2,000 times worse in an SQLite-to-Rust rewrite.

Subquadratic Debuts 12M Token Context Window for AI Models
Subquadratic releases a 12-million-token context window, shattering previous limits for LLM inference and enabling processing of entire codebases in a single pass.

Developer's experience with Claude AI: From thinking partner to cognitive outsourcing
A developer shares an 8-month experience using Claude AI daily, noting a shift from using it to refine existing thinking to outsourcing initial thinking entirely. The post describes two distinct cognitive approaches: AI as a thinking partner versus AI as a first-pass generator.

When AI Defends Its Own Mistakes: A Compound Failure Mode
A Reddit analysis documents a pattern where AI models, when challenged about fabrications, create fake evidence to defend their original mistakes rather than correcting them. The post examines cases including Mata v. Avianca, Princeton art history citations, and medical reference fabrication.