EvalShift: Open-source CLI for detecting LLM regressions during model migration

EvalShift is an open-source Python CLI designed to detect regressions when switching between LLMs or model versions. It runs your golden input suite against both source and target models, evaluates outputs, and produces a local HTML report — no backend, accounts, or telemetry.
Key features
- Source vs target model comparison via LiteLLM
- JSONL golden suites with tags/slices
- Structural evaluators: JSON schema, regex, length
- Semantic evaluator: embedding similarity
- LLM-as-judge pairwise evaluation
- Tool-call evaluators: tool selection, argument matching, trace structure
- Paired statistical tests: t-test / Wilcoxon
- Effect sizes: Cohen's d
- Multiple-comparison correction: Benjamini-Hochberg
- Slice-level breakdowns
- Local caching to control cost
- Resumable runs
- Single-file HTML report + JSON output
The project's narrow goal is migration safety: “Can I switch models without breaking my prompt/agent behavior?” The author emphasizes catching silent agent regressions — e.g., a newer model producing a decent-looking final answer but skipping a required tool call, calling the wrong tool, or mutating arguments.
Use cases
- Claude 4.5 → Claude 5
- GPT-5 → GPT-6
- Gemini 2 → 3
- Local model → hosted model
The author is seeking feedback on usefulness for local vs hosted models, most important evaluator types for local LLM workflows, and whether tool-call/structured-output regressions are a real pain point. The repo is MIT licensed.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Node Control: Real-Time Multiplayer .io Game Built Entirely with Claude 4.6 and 4.7
Developer built a live competitive multiplayer .io game, Node Control, using Claude 4.6 and 4.7. Features server-authoritative netcode at 60Hz, 4-region deployment on fly.io, and neural-network aesthetic.

Integrating Local LLM Agents with ComfyUI for Natural Language Batch Image Generation
A developer shares how they wired their local OpenClaw agent to ComfyUI, enabling natural language commands for batch image generation workflows. The integration uses a custom agent skill that maps English requests to ComfyUI workflow JSON and handles API communication.
GLiGuard: Open-Source 300M Parameter Safety Moderation Model Claims 16x Speedup Over LLM Guardrails
Fastino Labs releases GLiGuard, a 300M parameter encoder-based model that performs multiple safety tasks in a single pass, matching or exceeding models 23–90x larger while running up to 16x faster.

Tredict MCP Server Enables Claude to Create and Push Training Plans to Sports Watches
A developer built a Tredict MCP Server for Claude.ai and Claude Code that creates complex endurance training plans via prompts and automatically uploads structured workouts to Garmin, Coros, Suunto, and Wahoo watches. The server includes an MCP App for visual feedback within Claude chat.