Methodology for Consistent Benchmarking of Local vs Cloud LLMs

A developer on r/LocalLLaMA has detailed a methodology for obtaining consistent benchmark numbers when comparing local LLMs with cloud APIs, addressing common frustrations with apples-to-oranges comparisons due to differing latencies, scoring, and methodologies.
The Core Problem with Benchmarking
Naive comparisons that fire requests at both local and cloud models measure different things. Cloud APIs involve queueing, load balancing, and routing. Local models involve warm-up, batching, and GPU contention. The solution implemented is to use sequential requests only. While slower—a 60-call benchmark takes ~3 minutes instead of 45 seconds—it ensures each measurement is clean, isolating inference time from queue time.
The Measurement Setup
The setup uses ZenMux as a unified endpoint, providing one base URL for four models: GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, and a local Llama 4 quant. The approach works with any OpenAI-compatible endpoint, such as:
- llama.cpp server:
curl http://localhost:8080/v1/chat/completions ... - vLLM:
curl http://localhost:8000/v1/chat/completions ... - Ollama:
curl http://localhost:11434/v1/chat/completions ...
The key is using the same client code, timeout settings, and retry logic for everything.
How the Measurement Works
The system is structured into five modules: YAML Config → BenchRunner → AIClient → Analyzer → Reporter.
The YAML config defines tasks and models. Example:
suite: coding-benchmark
models:
- gpt-5.4
- claude-sonnet-4.6
- gemini-3.1-pro
- llama-4
runs_per_model: 3
tasks:
- name: fizzbuzz
prompt: "Write a Python function that prints FizzBuzz for numbers 1-100"
- name: refactor-suggestion
prompt: "Given this code, suggest improvements:\n\ndef calc(x):\n if x == 0: return 0\n if x == 1: return 1\n return calc(x-1) + calc(x-2)"The BenchRunner takes the Cartesian product of tasks × models × runs and calls the API sequentially, recording latency, prompt tokens, and completion tokens.
The Scoring Part
Quality scoring is rule-based, not LLM-as-judge, to avoid self-preference bias and ensure reproducibility. The _quality_score function uses three signals:
- Response length: 50–3000 characters scores 4.0, shorter scores 1.0, longer scores 3.0.
- Formatting: Presence of bullet points adds up to 3.0 points.
- Code presence: Detecting code blocks or function definitions adds 2.0 points.
Maximum score is 9.0. This reliably separates "good structured response" from "garbage/empty/hallucinated" for relative ranking. For latency, the 95th percentile response time (P95) is also calculated.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Building a Bridge for Two Telegram Bots in One Group Chat: Delivery Semantics Over HTTP
A developer shares a practical approach to connect two independent Telegram bots in the same group chat, tackling Telegram's bot-to-bot delivery gaps with HTTP relays, ACKs, deduplication, and strict scoped feeds.

Structuring Claude Code Agents with CLAUDE.md and .claude/ Directory Patterns
A developer shares their approach to running multiple AI agents using Claude Code, with each agent having its own directory containing a CLAUDE.md file and a .claude/ directory with rules and skills. The key insight is separating always-on context from on-demand workflows to optimize token usage and response quality.

Fix for Claude Desktop Workspace VM Service Issue on Windows 11 Home
A community-developed fix addresses the 'VM service not running' error in Claude Desktop's workspace feature on Windows 11 Home, with manual PowerShell commands and an automated tool available on GitHub.

Claude Code Skills vs. Custom Agents: A Mental Model Based on Task Consistency
A Reddit user clarifies the distinction between Claude Code skills and custom agents: skills execute the same steps every time, while custom agents require reasoning and adaptation. The post also covers parallel subagents, delegation, hooks, and building blocks.