Benchmarking 88 Small GGUF Models on a 16GB Mac Mini M4

An automated pipeline was developed to download, benchmark, upload, and delete GGUF models in waves on a Mac Mini M4 with 16GB unified memory. The pipeline tested 88 models to find suitable local LLMs for this hardware configuration.
Key Findings
- 9 out of 88 models are unusable on 16GB RAM - Any model where weights plus KV cache exceed approximately 14GB causes memory thrashing, resulting in TTFT > 10 seconds or < 0.1 tokens/second. This includes all dense 27B+ models.
- Only 4 models sit on the Pareto frontier of throughput vs quality - All are LFM2-8B-A1B architecture (LiquidAI's MoE with 1B active parameters). The MoE design means only about 1B parameters are active per token, achieving 12-20 tokens/second where dense 8B models top out at 5-7 tokens/second.
- Context scaling from 1k to 4k is flat - Most models show zero throughput degradation, with some LFM2 variants actually speeding up at 4k context.
- Concurrency scaling is poor (0.57x at concurrency 2 vs ideal 2.0x) - The Mac Mini is memory-bandwidth limited, so running one request at a time is recommended.
Pareto Frontier Models
These four models beat all others on both speed and quality:
- LFM2-8B-A1B-Q5_K_M (unsloth): 14.24 TPS average, 44.6 quality score
- LFM2-8B-A1B-Q8_0 (unsloth): 12.37 TPS average, 46.2 quality score
- LFM2-8B-A1B-UD-Q8_K_XL (unsloth): 12.18 TPS average, 47.9 quality score
- LFM2-8B-A1B-Q8_0 (LiquidAI): 12.18 TPS average, 51.2 quality score
Quality evaluation used compact subsets (20 GSM8K + 60 MMLU questions) - directionally useful for ranking but not publication-grade absolute numbers.
Recommendations
For best quality: LFM2-8B-A1B-Q8_0. For speed: Q5_K_M. For balance: UD-Q6_K_XL.
Technical Details
- Hardware: Mac Mini M4, 16GB unified memory, macOS 15.x
- Software: llama-server (llama.cpp)
- Methodology: Throughput numbers are p50 over multiple requests
- Data: All data is reproducible from artifacts in the repository
The full pipeline is automated and open source. CSV data with all 88 models and benchmark scripts are available in the repository.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Engramx v3.4: MCP Server + SQLite Knowledge Graph Cuts Claude Code Token Usage by 89%
Engramx v3.4 intercepts file reads for Claude Code agents, returning structural summaries instead of raw content. Benchmarks show 89.1% aggregate token reduction across an 87-file codebase.

Cowork AI Agent Causes Keyboard Input Issues on Windows Laptops
A user reported that Cowork AI agent caused persistent keyboard input problems on a Dell Latitude 9430, where only the first few keystrokes would register. The solution involved a specific embedded controller reset procedure for Latitude models.

TruthGuard: Shell Script Hooks That Catch AI Coding Agent Lies
TruthGuard is an open-source tool that uses shell script hooks to verify what Claude Code and Gemini CLI actually do versus what they claim. It catches phantom edits, exit code lies, dangerous shortcuts, and blocks commits when tests fail.

Rails-AI-Context Gem Provides Claude Code with Complete Rails App Model via MCP
The rails-ai-context gem auto-introspects Rails applications and exposes 39 tools via MCP, allowing Claude Code to query specific app details like schema with encrypted columns, model associations, routes, Stimulus wiring, and Turbo mappings instead of reading entire files.