Bullshit Benchmark Tests LLM Resistance to Nonsensical Prompts

What the Bullshit Benchmark Measures
The Bullshit Benchmark is a tool for testing whether large language models (LLMs) identify and push back on nonsensical prompts rather than confidently answering them. It measures how much a model is willing to go along with obvious nonsense, addressing concerns that models might self-induce hallucinations by trying to be helpful instead of calling out problematic prompts.
Key Benchmark Results
According to the source material, Claude models show significantly better performance than Gemini models in detecting nonsense. The results support the intuition that Claude models are better at this specific capability.
One example from the benchmark shows Claude successfully identifying a nonsense question while Gemini failed. Specifically, Gemini 3.1 Pro failed to detect an obvious nonsense question even with high thinking effort enabled, instead generating a nonsense answer.
The source suggests Anthropic's post-training approach contributes to Claude's better performance, noting that LLMs naturally tend toward superficial associative thinking that generates spurious relationships between concepts. Anthropic appears to have addressed this issue in their post-training pipeline.
Why This Matters for AI Coding Agents
For developers using AI coding assistants, a model's ability to recognize nonsense prompts is crucial. When models confidently answer nonsensical questions instead of pushing back, they can misguide users and generate incorrect code or explanations. This benchmark provides a concrete way to evaluate this specific safety behavior across different models.
You can view the complete benchmark results at https://petergpt.github.io/bullshit-benchmark/viewer/index.html.
📖 Read the full source: r/ClaudeAI
👀 See Also

Claude Code now supports 240+ models via NVIDIA NIM gateway — including Nemotron-3 120B for agentic coding
Claude Code can switch mid-session to 240+ NVIDIA NIM models via the /model command. The Nemotron-3 Super 120B thinking variant shows strong results for multi-file refactoring and agentic tasks.

Nelson: A Claude Code Plugin for Coordinating AI Agents Like a Naval Fleet
Nelson is a Claude Code plugin that structures AI agent coordination using naval fleet principles, featuring three execution modes, a risk classification system, hull integrity monitoring, and standing order gates to prevent common anti-patterns.

Introducing Roam-Code CLI: A Faster, Deterministic Alternative for Code Exploration
Roam-Code CLI replaces Claude Code's explore phase with a faster, deterministic alternative that indexes codebases for improved efficiency.

HomeButler: Zero-token homelab management for OpenClaw agents
HomeButler is a single Go binary that lets OpenClaw agents manage homelab infrastructure without API keys or tokens. It runs locally and keeps all operations on your network.