RTX 5060 Ti 16GB Local LLM Benchmarks: 30B Models Still Lead for Coding

✍️ OpenClawRadar📅 Published: April 19, 2026🔗 Source

RTX 5060 Ti 16GB Local LLM Performance Findings

Testing on an RTX 5060 Ti 16GB with 32GB DDR4 RAM using llama-server b8373 (46dba9fce) reveals practical performance characteristics for local LLM coding workflows. The setup used llama.cpp with specific launch settings: fast path with fa=on, ngl=auto, threads=8, and KV settings -ctk q8_0 -ctv q8_0.

Model Performance Results

The benchmark compared multiple quantized models with these key findings:

Best default coding model: Unsloth Qwen3-Coder-30B UD-Q3_K_XL
Best higher-context coding option: Same Unsloth 30B model at 96k context
Best fast 35B coding option: Unsloth Qwen3.5-35B UD-Q2_K_XL

Performance Metrics

Token generation speeds from local testing:

Jackrong Qwen 3.5 4B Q5_K_M: 88 tok/s
LuffyTheFox Qwen 3.5 9B Q4_K_M: 64 tok/s
Jackrong Qwen 3.5 27B Q3_K_S: ~20 tok/s
Unsloth Qwen 3.0 30B UD-Q3_K_XL: 76.3 tok/s
Unsloth Qwen 3.5 35B UD-Q2_K_XL: 80.1 tok/s

Cross-Platform Comparison

Matched tests with 20 questions, 32k context, and max_tokens=800 showed:

Unsloth Qwen3-Coder-30B UD-Q3_K_XL: Windows: 79.5 tok/s, quality 7.94 | Ubuntu: 76.3 tok/s, quality 8.14
Unsloth Qwen3.5-35B UD-Q2_K_XL: Windows: 72.3 tok/s, quality 7.40 | Ubuntu: 80.1 tok/s, quality 7.39
Jackrong Qwen3.5-27B Claude-Opus Distilled Q3_K_S: Windows: 19.9 tok/s, quality 8.85 | Ubuntu: ~20.0 tok/s, quality 8.21

Configuration Notes

The 30B coder path used: jinja, reasoning-budget 0, reasoning-format none. The 35B UD path used: c=262144, n-cpu-moe=8. For the 35B Q4_K_M stable tune, settings were: -ngl 26 -c 131072 --fit on --fit-ctx 131072 --fit-target 512M.

Notably, the 35B Q4_K_M model required specific tuning to run stable on this card but still didn't outperform the older UD-Q2_K_XL path in practical use. The author found that smaller models (9B route) and heavier experiments (35B Q4_K_M) weren't the strongest real-world picks despite expectations.

Ubuntu Performance Testing

Additional focused testing on Ubuntu with the Jackrong 27B model showed minimal variation:

-fa on, auto parallel: 19.95 tok/s
-fa auto, auto parallel: 19.56 tok/s
-fa on, --parallel 1: 19.26 tok/s

Flash-attention settings and parallel processing parameters had negligible impact on this particular model's performance.

📖 Read the full source: r/LocalLLaMA

👀 See Also

Tools

Claude 4.6 Opus Reasoning Distilled to 14GB for Apple Silicon via MLX Quantization

A developer has quantized a Qwen 3.5 27B model distilled from Claude 4.6 Opus reasoning trajectories from 55.6GB to 14GB using MLX for Apple Silicon, achieving ~16 tokens/sec on an M4 Pro while maintaining the model's analytical reasoning capabilities.

Mar 7, 2026, 04:45 PM UTC

OpenClawRadar

Tools

AgentMail Founder Details Agent-Native Onboarding After OpenClaw Exposed CAPTCHA Block

AgentMail, an email API for AI agents, rebuilt its onboarding flow after its own OpenClaw agent failed at a Cloudflare CAPTCHA. The new system offers a single REST endpoint for programmatic account creation while keeping humans in the loop for verification.

Apr 1, 2026, 12:45 PM UTC

OpenClawRadar

Tools

Memento v1.0: Persistent Memory MCP Server for Claude Code with 17 Tools

Memento v1.0 is a persistent memory MCP server for Claude Code that ships with 17 tools, hybrid search, contradiction detection, and a visual memory graph. It runs locally with no cloud dependencies and supports multiple IDEs including Claude Code, Cursor, Windsurf, and OpenCode.

Mar 24, 2026, 08:45 AM UTC

OpenClawRadar

Tools

ClawProxy: Self-Hosted AI Routing Proxy with Dashboard

ClawProxy is an open-source, self-hosted proxy that centralizes management of multiple AI API keys and models. It provides a unified endpoint, smart key rotation, provider fallback, and real-time logging via a React dashboard.

Apr 17, 2026, 04:45 AM UTC

OpenClawRadar