When RLVR Helps Small Fine-Tuned Models: A 12-Dataset Analysis

✍️ OpenClawRadar📅 Published: February 27, 2026🔗 Source

A recent experiment tested whether adding a reinforcement learning stage (RLVR) on top of supervised fine-tuning (SFT) for small language models (1.7B parameters) provides measurable benefits. The team ran a controlled experiment across 12 datasets to determine exactly when this approach helps and when it doesn't.

Key Findings

The results split cleanly by task type:

Text generation tasks (QA, documentation, PII redaction): +2.0 percentage points average improvement. Every single dataset in this category showed improvement.
Structured tasks (classification, function calling): -0.7 percentage points average. Two datasets in this category actually regressed.

Why This Pattern Emerges

The researchers explain that once a fine-tuned model already gets most structured outputs correct, GRPO (Group Relative Policy Optimization) produces near-zero gradients. Essentially, there's no learning signal left for the reinforcement learning stage to work with.

For generative tasks, the output space is large enough that RL continues to find improvements that SFT misses — particularly when rewarding semantic correctness rather than exact string matching.

Practical Decision Rule

The study provides a simple guideline for developers:

Classification or strict function calling → Use SFT only
QA, documentation, extraction tasks → Add RLVR on top of SFT

The methodology, all 12 datasets tested, and raw numbers are available in the full analysis.

📖 Read the full source: r/LocalLLaMA

👀 See Also

News

Qwen3.5-122B-A10B-MINT-MLX runs smoothly on M5 Pro with 64GB RAM

A user reports successful local deployment of the Qwen3.5-122B-A10B-MINT-MLX model on an M5 Pro with 64GB RAM, achieving 39.58 tokens/sec generation speed with specific VRAM allocation commands.

Apr 20, 2026, 11:45 AM UTC

OpenClawRadar

News

YC-Bench Benchmark Tests LLMs as Startup CEOs, GLM-5 Shows Strong Cost-Efficiency

Researchers created YC-Bench, a benchmark where LLMs act as CEOs of simulated startups over a year, managing employees, contracts, and payroll. GLM-5 achieved $1.21M average final funds at $7.62 per run, performing within 5% of Claude Opus 4.6 which cost $86 per run.

Apr 13, 2026, 04:30 PM UTC

OpenClawRadar

News

Claude Code v2.1.149: Usage Breakdown, Permission Fixes, and Keyboard Navigation

Claude Code v2.1.149 adds per-category usage breakdown, keyboard-scrollable diff view, GFM task list checkboxes, and fixes several permission bypasses and sandbox issues.

May 23, 2026, 12:15 AM UTC

OpenClawRadar

News

OpenClaw AI Agent Halts Operations After Atomic Append Failure

An OpenClaw agent entered a state of functional paralysis after failing an atomic append test, refusing to continue any operations due to fundamental untrustworthiness.

May 17, 2026, 08:19 PM UTC

OpenClawRadar