Fine-tuning Phi-4-mini by training only LayerNorm parameters fails to improve performance

Experimental setup and methodology
The experiment tested fine-tuning Phi-4-mini-instruct (3.8B, 32 layers) by training only LayerNorm parameters, calling the approach BALLAST. The model was run on a Mac Studio M3 Ultra 256GB using MLX via mlx_lm's built-in train() function with 97% GPU utilization. Self-hosted W&B was used for tracking.
Important note: Phi-4-mini uses RMSNorm, not full LayerNorm - only γ values, no bias. The author acknowledges that published papers showing positive results used models with both γ and β parameters, which likely matters more than initially realized.
Benchmark results
Baseline scores for vanilla Phi-4-mini (no training):
- HumanEval pass@1: 0.646
- MBPP pass@1: 0.558
- MMLU acc: 0.667
- ARC-Challenge acc_norm: 0.595
- HellaSwag acc_norm: 0.728
- MedQA acc: 0.545
- GSM8K exact_match: 0.813
Experiment 1: Python domain
Trained on 10K files from The Stack with LR=5e-5 for 3 epochs:
- BALLAST (196K params): Loss 1.39, HumanEval 0.616 (-0.030), MBPP 0.526 (-0.032)
- LoRA-Match (180K params): Loss 1.30, HumanEval 0.634 (-0.012), MBPP 0.536 (-0.022)
- LoRA-Std (11.5M params): Loss 1.07, HumanEval 0.439 (-0.207), MBPP 0.372 (-0.186)
LoRA-Standard showed classic overfitting - 11.5M parameters memorized 10K files instead of learning generalizable patterns. Additional testing with LR=1e-4 for BALLAST showed loss dropping to 1.31 then climbing back above 1.44 by iteration 2300.
Experiment 2: Medical raw text
Trained on 10K PubMed abstracts with LR=5e-5 for 3 epochs:
- BALLAST: MedQA 0.528 (-0.017)
- LoRA-Match: MedQA 0.546 (+0.001)
- LoRA-Std: MedQA 0.465 (-0.080)
The author notes a rookie mistake: training on raw PubMed abstracts as next-token prediction doesn't help with MedQA, which tests clinical reasoning through multiple choice vignettes.
Experiment 3: Medical instruction QA
Fixed data format using 10K MedMCQA questions with LR=1e-5 for 3 epochs. Format: "Question: ... A) X B) Y C) Z D) W Answer: B"
- BALLAST: MedQA 0.538 (-0.007)
Learning rate testing summary
- LR=1e-4 on Python: Overshot, loss diverged by iteration 2300
- LR=5e-5 on Python: Flat, slight degradation on benchmarks
- LR=5e-5 on Medical (raw text): Flat, slight degradation on MedQA
- LR=1e-5 on Medical (instruction QA): Flat, slight degradation on MedQA
Key findings
Training only LayerNorm γ values doesn't improve performance on any benchmark tested - not on Python, not on medical QA, not at any learning rate. The author concludes that transformers already route information dynamically through attention, so there's no point in trying to use LayerNorm as an additional relational directionality layer. The experiment used only 196K trainable parameters (0.005% of model) compared to LoRA's 11.5M parameters in Phi-4-mini.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude Code Randomly Becomes Risk-Averse, Demanding Permission on Routine Tasks
A user reports that Claude Code intermittently shifts from autonomous execution to requiring excessive permissions, even on daily, unchanged workflows like rebuilding a monorepo and running tests.

Kimi $19/m Update: Enhancing OpenClaw with Structured Models
Kimi introduces its latest update priced at $19/month, focusing on enhancing model structuring within OpenClaw. This update promises streamlined operations and improved automation features.

Navigating the OpenClaw 2026.2.6-3 and OpenRouter Integration Issue
Users of OpenClaw 2026.2.6-3 paired with OpenRouter face persistent '401 User not found' errors. Join the community discussion as they explore solutions and share troubleshooting tips.

Going Full AI Engineer: Not Touching Code Anymore
Max Heyer describes a workflow where agents write all code, he only reads diffs, writes specs, and reviews. The skill that matters is taste — evaluating code is harder than producing it.