Opus 4.6 excels at research, Gemini 3.1 Pro has better judgment in forecasting benchmark

A Reddit user posted results from a benchmark comparing four frontier models — Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, and Grok 4.20 — on 1,417 binary forecasting questions from October–December 2025. The key innovation is decomposing performance into two evaluation conditions: agentic (each model performs its own web research using tools) and fixed-evidence (all models receive the same ~12,000-character research dossier compiled via the Bosse et al. 2026 standardization methodology).
Key findings
- Opus 4.6 performs dramatically better in the agentic condition: it is better at figuring out what to search for, deciding which pages to read, and extracting relevant details. However, when research is removed, its advantage disappears.
- Gemini 3.1 Pro delivers sharper judgment on fixed evidence — it weights information more accurately on forecasting tasks. Its calibration actually improves when given the standardized dossier, while Opus's calibration drops sharply.
- GPT-5.4 and Grok 4.20 barely changed between conditions, suggesting their performance is less dependent on search strategy.
- The rank order swapped between Opus and Gemini across conditions, which the poster argues indicates the evaluation is not broken or biased (a biased eval would likely move all models in the same direction).
Interpretation
The asymmetry in calibration — Opus's calibration drops when search is removed, while Gemini's improves — suggests Opus may be using its search trace as scaffolding for probability assignment. In other words, the act of conducting the search loop itself does some of the epistemic work, separate from the information it surfaces. This is a novel finding that could have implications for how we evaluate and design AI research agents.
Limitations and resources
The fixed-evidence dossiers are themselves LM-produced, so the test may measure how well each model interprets a particular standardized version of the evidence rather than abstract judgment. The poster notes this as a limitation but argues that the divergent behavior across models reduces the concern.
Full calibration scores, refinement scores, and per-condition analysis are available at: futuresearch.ai/opus-research-gemini-judgment. The benchmark and leaderboard are at: evals.futuresearch.ai.
To the poster's knowledge, this is the first direct evaluation of frontier models that decomposes performance into research vs. judgment stages. They invite replication in other domains.
📖 Read the full source: r/ClaudeAI
👀 See Also

David Silver's Ineffable Intelligence Raises $1.1B for RL-Based Superlearner Without Human Data
Ineffable Intelligence, founded by DeepMind alum David Silver, raised $1.1B at a $5.1B valuation to build a reinforcement learning-based 'superlearner' that discovers knowledge without human data.

Claude Consumer Terms Analysis: Data Retention, Liability Caps, and Service Termination
An analysis of Anthropic's Consumer Terms of Service reveals key details for $100/month Max plan subscribers: data training is on by default with 5-year retention for opted-in users, liability is capped at $600 maximum, and service can be terminated without refund for violations.

Why OpenClaw's Open Source Architecture Matters

Stripe's Minions: One-Shot AI Coding Agents
Minions are Stripe's one-shot AI coding agents aiming to enhance developer productivity by leveraging end-to-end automation using LLMs.