Phi-4-mini Fine-tuning: LayerNorm Training Fails to Improve

Experimental setup and methodology

The experiment tested fine-tuning Phi-4-mini-instruct (3.8B, 32 layers) by training only LayerNorm parameters, calling the approach BALLAST. The model was run on a Mac Studio M3 Ultra 256GB using MLX via mlx_lm's built-in train() function with 97% GPU utilization. Self-hosted W&B was used for tracking.

Important note: Phi-4-mini uses RMSNorm, not full LayerNorm - only γ values, no bias. The author acknowledges that published papers showing positive results used models with both γ and β parameters, which likely matters more than initially realized.

Benchmark results

Baseline scores for vanilla Phi-4-mini (no training):

HumanEval pass@1: 0.646
MBPP pass@1: 0.558
MMLU acc: 0.667
ARC-Challenge acc_norm: 0.595
HellaSwag acc_norm: 0.728
MedQA acc: 0.545
GSM8K exact_match: 0.813

Experiment 1: Python domain

Trained on 10K files from The Stack with LR=5e-5 for 3 epochs:

BALLAST (196K params): Loss 1.39, HumanEval 0.616 (-0.030), MBPP 0.526 (-0.032)
LoRA-Match (180K params): Loss 1.30, HumanEval 0.634 (-0.012), MBPP 0.536 (-0.022)
LoRA-Std (11.5M params): Loss 1.07, HumanEval 0.439 (-0.207), MBPP 0.372 (-0.186)

LoRA-Standard showed classic overfitting - 11.5M parameters memorized 10K files instead of learning generalizable patterns. Additional testing with LR=1e-4 for BALLAST showed loss dropping to 1.31 then climbing back above 1.44 by iteration 2300.

Experiment 2: Medical raw text

Trained on 10K PubMed abstracts with LR=5e-5 for 3 epochs:

BALLAST: MedQA 0.528 (-0.017)
LoRA-Match: MedQA 0.546 (+0.001)
LoRA-Std: MedQA 0.465 (-0.080)

The author notes a rookie mistake: training on raw PubMed abstracts as next-token prediction doesn't help with MedQA, which tests clinical reasoning through multiple choice vignettes.

Experiment 3: Medical instruction QA

Fixed data format using 10K MedMCQA questions with LR=1e-5 for 3 epochs. Format: "Question: ... A) X B) Y C) Z D) W Answer: B"

BALLAST: MedQA 0.538 (-0.007)

Learning rate testing summary

LR=1e-4 on Python: Overshot, loss diverged by iteration 2300
LR=5e-5 on Python: Flat, slight degradation on benchmarks
LR=5e-5 on Medical (raw text): Flat, slight degradation on MedQA
LR=1e-5 on Medical (instruction QA): Flat, slight degradation on MedQA

Key findings

Training only LayerNorm γ values doesn't improve performance on any benchmark tested - not on Python, not on medical QA, not at any learning rate. The author concludes that transformers already route information dynamically through attention, so there's no point in trying to use LayerNorm as an additional relational directionality layer. The experiment used only 196K trainable parameters (0.005% of model) compared to LoRA's 11.5M parameters in Phi-4-mini.

📖 Read the full source: r/LocalLLaMA