Homelab Developer Benchmarks 19 Local LLMs with 45 Practical Tests on AMD Strix Halo

Practical Benchmarking for Real-World LLM Use Cases
A developer with a homelab setup conducted extensive testing of local LLMs using a custom 45-test benchmark suite designed around actual use cases rather than generic academic benchmarks. The tests were run on an AMD Strix Halo system with Ryzen AI MAX+ 395, 128GB RAM, and 96GB shared VRAM using Vulkan/RADV with llama-server (kyuz0 Docker image).
Why Custom Benchmarks Matter
The developer uses Claude Opus for interactive coding but needs local models for 24/7 services including:
- Email classification running every 15 minutes to sort 50+ emails
- Camera notifications using vision models to describe motion alerts
- Meal planning with dietary constraints
- Finance analysis for tax scenarios and portfolio projections
- Home Assistant automation generation and validation
These tasks require fast, reliable models with good structured output capabilities that generic benchmarks like MMLU scores don't adequately measure.
The 45-Test Suite
The benchmark includes tests across 12 categories, each scored 0-10 by Claude Opus 4.6 against specific rubrics:
- Coding (4 tests): Docker Compose, systemd services, Python scripts, code review
- Homelab ops (6 tests): Memory analysis, OOM debugging, disk triage, network debug, log parsing
- Tool calling (5 tests): Proxmox pct/qm commands, SSH chains, Docker ops, git workflows
- Food/meal planning (6 tests): JSON meal plans, prep schedules, recipe scaling, shopping lists, nutrition
- Finance (5 tests): Tax calculations, portfolio analysis, FIRE projections, tax-loss harvesting
- Email classification (3 tests): Category assignment, ambiguous cases, unsubscribe decisions
- Home Assistant (3 tests): Automation YAML, template sensors, conditions
- Math (4 tests): Mortgage payoff, probability, number theory, tax optimization
- Reasoning (3 tests): Energy bills, statistics, logic constraints
- Instruction following (3 tests): Format compliance, JSON output, negative constraints
- Long context (1 test): Extract facts from 8K-token infrastructure doc
- Speed (2 tests): Time-to-first-token, sustained generation
Nine tests are weighted 2x as "critical" for the developer's most common use cases, with a maximum possible score of 540.
Testing Methodology
Each test has specific rubrics defining what constitutes a good answer. For example, the memory analysis test requires correctly identifying that "available" memory (22G) is the real free metric, not "free" (5.7G), and that swap usage is non-critical. The tax calculation test checks for correct AGI, taxable income, and bracket math. All raw responses and rubrics are saved for cross-checking.
Models Tested
The developer tested 19 model configurations across 6 families on Vulkan with llama-server, including:
- Qwen family: Qwen3.5-122B-A10B (10B active MoE) - previously used in production, Qwen3-Coder-Next 80B-A3B (3B active)
- Gemma 4 26B-A4B - ended up on top after fixing two separate bugs that made it appear broken initially
The developer notes this isn't rigorous academic methodology but practical testing to determine which models work best for specific homelab tasks.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Using Claude as a Critical Product Manager for Landing Page Optimization
A developer used Claude to critique and rewrite their landing page by treating it as a harsh, contrarian product manager, resulting in improved messaging and SEO performance.

Reddit user shares system for using Claude as a work operating system
A Reddit user describes moving beyond using Claude like a search engine to implementing a 10-step system with specific folder structures, file types, and interaction methods that treat Claude as a primary work operating system.

Mesh Architecture for AI Agents: Client Isolation and Cross-Project Coordination
A developer running a micro-agency describes a mesh architecture where each client gets specialized AI agents that communicate via markdown files, enabling domain expertise, cross-project coordination, and client isolation across 44 projects and 14 organizations.

A Developer's $2,500 Opus Token Burn on OpenClaw: Real-World Workflows vs. Tooling
A software shop owner recounts spending $2,500 on Opus tokens through OpenClaw, using it for bug fixes, visual automation, and server management — but questions what a 'workflow' actually means.