Steelman R5: Fine-tuned 14B Model Outperforms Claude Opus on Ada Code Generation

Model and Training Details
The Steelman R5 model is a fine-tuned version of Qwen2.5-Coder-14B-Instruct specifically optimized for Ada code generation. Training used QLoRA 4-bit via Unsloth with TRL SFTTrainer on a dataset of 3,430 Ada/SPARK instruction pairs where every training example passes gnatmake -gnat2022 -gnatwa compilation.
Training configuration: LoRA rank 32, alpha 64, targeting q/k/v/o/gate/up/down projections. The model was fully retrained from base each round on accumulated dataset (adapter continuation caused catastrophic forgetting at R2). Training ran for 1 epoch with learning rate 2e-5, constant schedule, taking about 49 minutes per round on a rented H100. Five rounds total (R1–R5), with R2 discarded.
Benchmark Results
Custom Ada Compilation Benchmark (1,000 prompts, first-attempt clean compile):
- Steelman R5 (14B): 68.6% compile rate
- Claude Opus 4.6: 42.1% compile rate
- Claude Sonnet 4.6: 37.2% compile rate
- Qwen2.5-Coder-14B (base, untuned): ~35% compile rate
- Claude Sonnet 4: 27.5% compile rate
MultiPL-E HumanEval-Ada (157 problems, pass@1):
- Steelman R5: 47.1% pass@1, 74.5% compile rate
- Qwen2.5-Coder-14B (base): 34.4% pass@1, 51.0% compile rate
These are the first published Ada pass@1 results on HumanEval for any open model.
Usage and Availability
Run the model with: ollama run hf.co/the-clanker-lover/steelman-14b-ada-v0.1-GGUF
The GGUF version fits in 12GB VRAM with Q4_K_M quantization.
Limitations
- Compilation ≠ correctness: 68.6% compiles, but only 47.1% produces correct output on HumanEval
- Error-fix capability is weak (5.1%) - don't expect it to debug Ada code
- SPARK contracts compile but aren't verified with gnatprove
- Synthetically generated training data - no human Ada developers wrote these examples
- 14B model size means it may miss things a larger model would catch
Resources
- Model: https://huggingface.co/the-clanker-lover/steelman-14b-ada-v0.1
- GGUF: https://huggingface.co/the-clanker-lover/steelman-14b-ada-v0.1-GGUF
- Dataset: https://huggingface.co/datasets/the-clanker-lover/steelman-sft-ada
📖 Read the full source: r/LocalLLaMA
👀 See Also

Approach to Self-Improving Memory in Local AI Agents
A developer shares their approach to persistent memory for local AI agents using markdown files as source of truth, episode scoring with confidence-based rules, and trust escalation based on approval patterns.

Routing Claude API traffic to control costs after Max subscription change
Anthropic's Max subscription no longer covers third-party tool usage, forcing OpenClaw users to API billing. A routing proxy directs simple tasks to Claude Sonnet ($3/M input, $15/M output) and complex ones to Opus ($5/M input, $25/M output), cutting costs without quality loss.

Pilot Protocol: A P2P Network Stack for AI Agents Built with Claude
A developer built Pilot Protocol, a pure user-space peer-to-peer virtual network stack in Go specifically for autonomous AI agents, enabling direct communication without centralized infrastructure. The protocol uses UDP multiplexing, NAT traversal, and end-to-end encryption, with benchmarks showing 89 MB/s local throughput and 2.1 MB/s cross-continent WAN throughput.

Qure: Desktop App for Generating E2E Tests from Recorded Browser Flows
Qure is a desktop application from JetBrains (currently in closed beta) that generates end-to-end web test code from recordings made in its built-in browser. Instead of describing test flows in text for AI agents, developers record their manual QA scenarios by interacting with their product, and the AI produces working test code that matches their existing codebase.