Steelman R5 14B Beats Claude Opus 4.6 in Ada Code Generation

Model and Training Details

The Steelman R5 model is a fine-tuned version of Qwen2.5-Coder-14B-Instruct specifically optimized for Ada code generation. Training used QLoRA 4-bit via Unsloth with TRL SFTTrainer on a dataset of 3,430 Ada/SPARK instruction pairs where every training example passes gnatmake -gnat2022 -gnatwa compilation.

Training configuration: LoRA rank 32, alpha 64, targeting q/k/v/o/gate/up/down projections. The model was fully retrained from base each round on accumulated dataset (adapter continuation caused catastrophic forgetting at R2). Training ran for 1 epoch with learning rate 2e-5, constant schedule, taking about 49 minutes per round on a rented H100. Five rounds total (R1–R5), with R2 discarded.

Benchmark Results

Custom Ada Compilation Benchmark (1,000 prompts, first-attempt clean compile):

Steelman R5 (14B): 68.6% compile rate
Claude Opus 4.6: 42.1% compile rate
Claude Sonnet 4.6: 37.2% compile rate
Qwen2.5-Coder-14B (base, untuned): ~35% compile rate
Claude Sonnet 4: 27.5% compile rate

MultiPL-E HumanEval-Ada (157 problems, pass@1):

Steelman R5: 47.1% pass@1, 74.5% compile rate
Qwen2.5-Coder-14B (base): 34.4% pass@1, 51.0% compile rate

These are the first published Ada pass@1 results on HumanEval for any open model.

Usage and Availability

Run the model with: ollama run hf.co/the-clanker-lover/steelman-14b-ada-v0.1-GGUF

The GGUF version fits in 12GB VRAM with Q4_K_M quantization.

Limitations

Compilation ≠ correctness: 68.6% compiles, but only 47.1% produces correct output on HumanEval
Error-fix capability is weak (5.1%) - don't expect it to debug Ada code
SPARK contracts compile but aren't verified with gnatprove
Synthetically generated training data - no human Ada developers wrote these examples
14B model size means it may miss things a larger model would catch