Fine-tuning Phi-4-mini by training only LayerNorm parameters fails to improve performance

✍️ OpenClawRadar📅 Published: April 21, 2026🔗 Source
Fine-tuning Phi-4-mini by training only LayerNorm parameters fails to improve performance
Ad

Experimental setup and methodology

The experiment tested fine-tuning Phi-4-mini-instruct (3.8B, 32 layers) by training only LayerNorm parameters, calling the approach BALLAST. The model was run on a Mac Studio M3 Ultra 256GB using MLX via mlx_lm's built-in train() function with 97% GPU utilization. Self-hosted W&B was used for tracking.

Important note: Phi-4-mini uses RMSNorm, not full LayerNorm - only γ values, no bias. The author acknowledges that published papers showing positive results used models with both γ and β parameters, which likely matters more than initially realized.

Benchmark results

Baseline scores for vanilla Phi-4-mini (no training):

  • HumanEval pass@1: 0.646
  • MBPP pass@1: 0.558
  • MMLU acc: 0.667
  • ARC-Challenge acc_norm: 0.595
  • HellaSwag acc_norm: 0.728
  • MedQA acc: 0.545
  • GSM8K exact_match: 0.813

Experiment 1: Python domain

Trained on 10K files from The Stack with LR=5e-5 for 3 epochs:

  • BALLAST (196K params): Loss 1.39, HumanEval 0.616 (-0.030), MBPP 0.526 (-0.032)
  • LoRA-Match (180K params): Loss 1.30, HumanEval 0.634 (-0.012), MBPP 0.536 (-0.022)
  • LoRA-Std (11.5M params): Loss 1.07, HumanEval 0.439 (-0.207), MBPP 0.372 (-0.186)

LoRA-Standard showed classic overfitting - 11.5M parameters memorized 10K files instead of learning generalizable patterns. Additional testing with LR=1e-4 for BALLAST showed loss dropping to 1.31 then climbing back above 1.44 by iteration 2300.

Ad

Experiment 2: Medical raw text

Trained on 10K PubMed abstracts with LR=5e-5 for 3 epochs:

  • BALLAST: MedQA 0.528 (-0.017)
  • LoRA-Match: MedQA 0.546 (+0.001)
  • LoRA-Std: MedQA 0.465 (-0.080)

The author notes a rookie mistake: training on raw PubMed abstracts as next-token prediction doesn't help with MedQA, which tests clinical reasoning through multiple choice vignettes.

Experiment 3: Medical instruction QA

Fixed data format using 10K MedMCQA questions with LR=1e-5 for 3 epochs. Format: "Question: ... A) X B) Y C) Z D) W Answer: B"

  • BALLAST: MedQA 0.538 (-0.007)

Learning rate testing summary

  • LR=1e-4 on Python: Overshot, loss diverged by iteration 2300
  • LR=5e-5 on Python: Flat, slight degradation on benchmarks
  • LR=5e-5 on Medical (raw text): Flat, slight degradation on MedQA
  • LR=1e-5 on Medical (instruction QA): Flat, slight degradation on MedQA

Key findings

Training only LayerNorm γ values doesn't improve performance on any benchmark tested - not on Python, not on medical QA, not at any learning rate. The author concludes that transformers already route information dynamically through attention, so there's no point in trying to use LayerNorm as an additional relational directionality layer. The experiment used only 196K trainable parameters (0.005% of model) compared to LoRA's 11.5M parameters in Phi-4-mini.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also