EvalShift: Open-source CLI to Detect LLM Regressions

EvalShift is an open-source Python CLI designed to detect regressions when switching between LLMs or model versions. It runs your golden input suite against both source and target models, evaluates outputs, and produces a local HTML report — no backend, accounts, or telemetry.

Key features

Source vs target model comparison via LiteLLM
JSONL golden suites with tags/slices
Structural evaluators: JSON schema, regex, length
Semantic evaluator: embedding similarity
LLM-as-judge pairwise evaluation
Tool-call evaluators: tool selection, argument matching, trace structure
Paired statistical tests: t-test / Wilcoxon
Effect sizes: Cohen's d
Multiple-comparison correction: Benjamini-Hochberg
Slice-level breakdowns
Local caching to control cost
Resumable runs
Single-file HTML report + JSON output

The project's narrow goal is migration safety: “Can I switch models without breaking my prompt/agent behavior?” The author emphasizes catching silent agent regressions — e.g., a newer model producing a decent-looking final answer but skipping a required tool call, calling the wrong tool, or mutating arguments.

Use cases

Claude 4.5 → Claude 5
GPT-5 → GPT-6
Gemini 2 → 3
Local model → hosted model

The author is seeking feedback on usefulness for local vs hosted models, most important evaluator types for local LLM workflows, and whether tool-call/structured-output regressions are a real pain point. The repo is MIT licensed.

📖 Read the full source: r/LocalLLaMA

EvalShift: Open-source CLI for detecting LLM regressions during model migration

Key features

Use cases

👀 See Also

Node Control: Real-Time Multiplayer .io Game Built Entirely with Claude 4.6 and 4.7

Integrating Local LLM Agents with ComfyUI for Natural Language Batch Image Generation

GLiGuard: Open-Source 300M Parameter Safety Moderation Model Claims 16x Speedup Over LLM Guardrails

Tredict MCP Server Enables Claude to Create and Push Training Plans to Sports Watches