How Small Model Evaluation Prompts Can Mislead and How to Fix Them

A detailed analysis on r/LocalLLaMA explains why evaluation prompts for small models (like 7B or 12B parameter models) often produce misleading, overly optimistic scores that don't match actual output quality. The core issue isn't model capability but how prompts activate different cognitive pathways in transformer architectures.
The Three Cognitive Modes of Transformers
The post identifies three functional pathways that models use based on prompt language:
- Dimension 1 (D1) — Factual Recall: Activated by questions like "What is...", "Define...", "When did...". The model retrieves knowledge stored during training. For evaluation tasks, this is mostly irrelevant.
- Dimension 2 (D2) — Application and Instruction Following: Activated by language like "Analyze...", "Classify...", "Apply these criteria...". The model applies explicit rules, follows structured instructions, and classifies inputs against provided criteria. This is the reliable pathway where small models are genuinely competent.
- Dimension 3 (D3) — Emotional and Empathic Inference: Activated by language like "How should this feel?", "What emotional response is appropriate?", "As an empathetic assistant...". The model infers unstated emotional context and makes normative judgments about how things "should" feel, routing through RLHF conditioning rather than evidence in the prompt. Small models are unreliable here, with bias consistently running positive and supportive regardless of actual content.
The Routing Insight
The key insight: "Analyze the emotional content" activates D2 (the model looks at text and classifies it), while "What should the user be feeling?" activates D3 (the model guesses what a helpful AI would say). These feel like equivalent questions but produce systematically different outputs.
Concrete Failure Example
The author tested this empirically with a Mistral 7B sentiment analyzer for a conversational AI system. The original prompt (simplified):
You are an empathetic AI companion analyzing emotional content. Analyze this message and return: { "tone": "warm, affectionate, grateful", "intensity": 0.0 to 1.0, "descriptors": ["example1", "example2"] }
What happened: Neutral messages returned slightly positive tone. Mildly negative messages scored as neutral or lightly positive. Intensity values for negative content were consistently lower than intensity values for equivalent positive content. This systematic, reproducible bias is called positive phantom drift — the model's RLHF conditioning pulling outputs toward supportive, positive responses regardless of actual input content.
Three things caused this failure:
- "Empathetic AI companion" activated D3, shifting the model into the social-expectation pathway
- Example values in the JSON template ("warm, affectionate, grateful") primed the model toward positive outputs
- The model was generating what a helpful AI would say rather than analyzing the evidence
The post emphasizes that small models can perform well on evaluation tasks when prompts deliberately activate D2 (application/instruction following) rather than D3 (emotional inference). The difference between "Analyze the emotional content" and "What should the user be feeling?" determines whether you get reliable classification or biased social expectation responses.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Opus 4.7 Broke 40% of Prompts; Fix Was Structuring CLAUDE.md and Skills
After Opus 4.7 degraded ~40% of prompts across 6 setups, a fractional head of AI fixed it by replacing ad-hoc prompts with structured Skill files, hierarchical CLAUDE.md, and separate memory files — reducing token usage 22% and iteration turns from 3-4 to 1-2.

How to avoid unexpected OpenRouter costs in OpenClaw automation
A developer team accidentally spent $750 in 3 days on OpenRouter by defaulting to Claude Sonnet 4.6 ($3/M tokens) across all automation tasks. They reduced costs by 97% by changing default models, locking cron jobs and subagents to cheaper options, and reserving expensive models only for sensitive work.

6 Patterns That Make Claude Code Skill Files Actually Activate
After testing 2,300+ skill files, a developer identified 6 patterns determining whether a Claude Code skill loads when needed – including specific trigger language, one capability per file, and when-not-to-use lists.

SOUL.md rules drift in long AI agent sessions and how to fix it
SOUL.md rules work perfectly for the first 10-15 messages but start drifting around message 20-30 as conversation context overrides the initial system prompt. The solution is to use /new more aggressively to reset sessions before each distinct task.