RTX 5060 Ti 16GB Local LLM Benchmarks: 30B Models Still Lead for Coding

RTX 5060 Ti 16GB Local LLM Performance Findings
Testing on an RTX 5060 Ti 16GB with 32GB DDR4 RAM using llama-server b8373 (46dba9fce) reveals practical performance characteristics for local LLM coding workflows. The setup used llama.cpp with specific launch settings: fast path with fa=on, ngl=auto, threads=8, and KV settings -ctk q8_0 -ctv q8_0.
Model Performance Results
The benchmark compared multiple quantized models with these key findings:
- Best default coding model: Unsloth Qwen3-Coder-30B UD-Q3_K_XL
- Best higher-context coding option: Same Unsloth 30B model at 96k context
- Best fast 35B coding option: Unsloth Qwen3.5-35B UD-Q2_K_XL
Performance Metrics
Token generation speeds from local testing:
- Jackrong Qwen 3.5 4B Q5_K_M: 88 tok/s
- LuffyTheFox Qwen 3.5 9B Q4_K_M: 64 tok/s
- Jackrong Qwen 3.5 27B Q3_K_S: ~20 tok/s
- Unsloth Qwen 3.0 30B UD-Q3_K_XL: 76.3 tok/s
- Unsloth Qwen 3.5 35B UD-Q2_K_XL: 80.1 tok/s
Cross-Platform Comparison
Matched tests with 20 questions, 32k context, and max_tokens=800 showed:
- Unsloth Qwen3-Coder-30B UD-Q3_K_XL: Windows: 79.5 tok/s, quality 7.94 | Ubuntu: 76.3 tok/s, quality 8.14
- Unsloth Qwen3.5-35B UD-Q2_K_XL: Windows: 72.3 tok/s, quality 7.40 | Ubuntu: 80.1 tok/s, quality 7.39
- Jackrong Qwen3.5-27B Claude-Opus Distilled Q3_K_S: Windows: 19.9 tok/s, quality 8.85 | Ubuntu: ~20.0 tok/s, quality 8.21
Configuration Notes
The 30B coder path used: jinja, reasoning-budget 0, reasoning-format none. The 35B UD path used: c=262144, n-cpu-moe=8. For the 35B Q4_K_M stable tune, settings were: -ngl 26 -c 131072 --fit on --fit-ctx 131072 --fit-target 512M.
Notably, the 35B Q4_K_M model required specific tuning to run stable on this card but still didn't outperform the older UD-Q2_K_XL path in practical use. The author found that smaller models (9B route) and heavier experiments (35B Q4_K_M) weren't the strongest real-world picks despite expectations.
Ubuntu Performance Testing
Additional focused testing on Ubuntu with the Jackrong 27B model showed minimal variation:
-fa on, auto parallel: 19.95 tok/s-fa auto, auto parallel: 19.56 tok/s-fa on,--parallel 1: 19.26 tok/s
Flash-attention settings and parallel processing parameters had negligible impact on this particular model's performance.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude 4.6 Opus Reasoning Distilled to 14GB for Apple Silicon via MLX Quantization
A developer has quantized a Qwen 3.5 27B model distilled from Claude 4.6 Opus reasoning trajectories from 55.6GB to 14GB using MLX for Apple Silicon, achieving ~16 tokens/sec on an M4 Pro while maintaining the model's analytical reasoning capabilities.

AgentMail Founder Details Agent-Native Onboarding After OpenClaw Exposed CAPTCHA Block
AgentMail, an email API for AI agents, rebuilt its onboarding flow after its own OpenClaw agent failed at a Cloudflare CAPTCHA. The new system offers a single REST endpoint for programmatic account creation while keeping humans in the loop for verification.

Memento v1.0: Persistent Memory MCP Server for Claude Code with 17 Tools
Memento v1.0 is a persistent memory MCP server for Claude Code that ships with 17 tools, hybrid search, contradiction detection, and a visual memory graph. It runs locally with no cloud dependencies and supports multiple IDEs including Claude Code, Cursor, Windsurf, and OpenCode.

ClawProxy: Self-Hosted AI Routing Proxy with Dashboard
ClawProxy is an open-source, self-hosted proxy that centralizes management of multiple AI API keys and models. It provides a unified endpoint, smart key rotation, provider fallback, and real-time logging via a React dashboard.