Steelman R5: Fine-tuned 14B Model Outperforms Claude Opus on Ada Code Generation

✍️ OpenClawRadar📅 Published: March 13, 2026🔗 Source
Steelman R5: Fine-tuned 14B Model Outperforms Claude Opus on Ada Code Generation
Ad

Model and Training Details

The Steelman R5 model is a fine-tuned version of Qwen2.5-Coder-14B-Instruct specifically optimized for Ada code generation. Training used QLoRA 4-bit via Unsloth with TRL SFTTrainer on a dataset of 3,430 Ada/SPARK instruction pairs where every training example passes gnatmake -gnat2022 -gnatwa compilation.

Training configuration: LoRA rank 32, alpha 64, targeting q/k/v/o/gate/up/down projections. The model was fully retrained from base each round on accumulated dataset (adapter continuation caused catastrophic forgetting at R2). Training ran for 1 epoch with learning rate 2e-5, constant schedule, taking about 49 minutes per round on a rented H100. Five rounds total (R1–R5), with R2 discarded.

Benchmark Results

Custom Ada Compilation Benchmark (1,000 prompts, first-attempt clean compile):

  • Steelman R5 (14B): 68.6% compile rate
  • Claude Opus 4.6: 42.1% compile rate
  • Claude Sonnet 4.6: 37.2% compile rate
  • Qwen2.5-Coder-14B (base, untuned): ~35% compile rate
  • Claude Sonnet 4: 27.5% compile rate

MultiPL-E HumanEval-Ada (157 problems, pass@1):

  • Steelman R5: 47.1% pass@1, 74.5% compile rate
  • Qwen2.5-Coder-14B (base): 34.4% pass@1, 51.0% compile rate

These are the first published Ada pass@1 results on HumanEval for any open model.

Ad

Usage and Availability

Run the model with: ollama run hf.co/the-clanker-lover/steelman-14b-ada-v0.1-GGUF

The GGUF version fits in 12GB VRAM with Q4_K_M quantization.

Limitations

  • Compilation ≠ correctness: 68.6% compiles, but only 47.1% produces correct output on HumanEval
  • Error-fix capability is weak (5.1%) - don't expect it to debug Ada code
  • SPARK contracts compile but aren't verified with gnatprove
  • Synthetically generated training data - no human Ada developers wrote these examples
  • 14B model size means it may miss things a larger model would catch

Resources

  • Model: https://huggingface.co/the-clanker-lover/steelman-14b-ada-v0.1
  • GGUF: https://huggingface.co/the-clanker-lover/steelman-14b-ada-v0.1-GGUF
  • Dataset: https://huggingface.co/datasets/the-clanker-lover/steelman-sft-ada

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also

Approach to Self-Improving Memory in Local AI Agents
Tools

Approach to Self-Improving Memory in Local AI Agents

A developer shares their approach to persistent memory for local AI agents using markdown files as source of truth, episode scoring with confidence-based rules, and trust escalation based on approval patterns.

OpenClawRadar
Routing Claude API traffic to control costs after Max subscription change
Tools

Routing Claude API traffic to control costs after Max subscription change

Anthropic's Max subscription no longer covers third-party tool usage, forcing OpenClaw users to API billing. A routing proxy directs simple tasks to Claude Sonnet ($3/M input, $15/M output) and complex ones to Opus ($5/M input, $25/M output), cutting costs without quality loss.

OpenClawRadar
Pilot Protocol: A P2P Network Stack for AI Agents Built with Claude
Tools

Pilot Protocol: A P2P Network Stack for AI Agents Built with Claude

A developer built Pilot Protocol, a pure user-space peer-to-peer virtual network stack in Go specifically for autonomous AI agents, enabling direct communication without centralized infrastructure. The protocol uses UDP multiplexing, NAT traversal, and end-to-end encryption, with benchmarks showing 89 MB/s local throughput and 2.1 MB/s cross-continent WAN throughput.

OpenClawRadar
Qure: Desktop App for Generating E2E Tests from Recorded Browser Flows
Tools

Qure: Desktop App for Generating E2E Tests from Recorded Browser Flows

Qure is a desktop application from JetBrains (currently in closed beta) that generates end-to-end web test code from recordings made in its built-in browser. Instead of describing test flows in text for AI agents, developers record their manual QA scenarios by interacting with their product, and the AI produces working test code that matches their existing codebase.

OpenClawRadar