EsoLang-Bench: Esoteric Language Coding Benchmark Tests LLM Reasoning

EsoLang-Bench is a new coding benchmark designed to test whether large language models can genuinely reason through problems or are simply pattern-matching against training data. The benchmark uses esoteric programming languages with minimal training data presence.

Benchmark Design

The benchmark uses five esoteric programming languages: Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare. These languages were chosen because they have almost zero training data in typical pretraining pipelines. The benchmark contains the same algorithmic problems as HumanEval across the same difficulty range, just translated to these esoteric languages.

Testing Methodology

Researchers tested five models: GPT-5.2, O4-mini, Gemini 3 Pro, Qwen3-235B, and Kimi K2. They used five prompting strategies including:

Self-scaffolding
Coder-critic pairs
ReAct pipeline

Results

The best single result was 11.2% on Befunge-98 with self-scaffolding. Medium, Hard, and Extra-Hard difficulty problems stayed at 0% across all models, languages, and strategies. Few-shot prompting gave only +0.8 percentage points on average, which researchers describe as statistically indistinguishable from noise.

Agentic systems like Claude Code and Codex performed 2-3x better than non-agentic approaches, but this improvement came primarily from sharper feedback loops and context management rather than evidence of actual reasoning transfer.

Error Analysis

The error breakdown reveals interesting patterns:

On Brainfuck (which has some online presence), models could produce valid syntax but failed on logic
On Whitespace (which has almost no training data), models couldn't even produce valid programs at all

This shows a clear gap between models' performance on languages with some pretraining data versus those with basically none.

Purpose and Availability

The benchmark aims to create evaluations where high scores are actually hard to fake, moving beyond just harder problems in mainstream languages like Python. The researchers suggest this approach creates evaluations where the economic incentive to game the benchmark doesn't exist, and the only route to good performance is genuine learning to generalize.

EsoLang-Bench is available as a template for others to build upon, whether through new languages, new problem types, or entirely different out-of-distribution domains.

📖 Read the full source: r/LocalLLaMA