Unsloth and NVIDIA Collaborate to Speed Up LLM Training by ~25%

✍️ OpenClawRadar📅 Published: May 7, 2026🔗 Source
Unsloth and NVIDIA Collaborate to Speed Up LLM Training by ~25%
Ad

Unsloth's collaboration with NVIDIA yields a ~25% training speedup (no accuracy loss) by implementing three key optimizations: caching packed-sequence metadata, double-buffered async gradient checkpointing, and MoE routing improvements. These are auto-enabled on RTX laptops, data center GPUs, and DGX Spark with an Unsloth update.

Caching Packed-Sequence Metadata

Packed training concatenates short examples to avoid padding waste. Each transformer layer previously rebuilt the same sequence metadata (lengths, cu_seqlens, max_seqlen, mask structure) from scratch, causing device-host synchronization overhead. By caching the metadata once per batch and reusing it across layers, Unsloth reduces repeated work.

Benchmarks on Qwen3-14B QLoRA SFT show:

  • Forward pass: +43.3% faster
  • Backward pass: +5.8% faster
  • Overall per batch: +14.3% faster

A microbenchmark on NVIDIA Blackwell GPUs measured the dominant mask-construction cost at ~13.7 ms per packed batch. For Llama-3.2-1B (16 layers), this translates to ~199 ms saved per step (11.5% lower); for Qwen3-0.6B (28 layers), ~319 ms saved (14.8% lower).

Ad

Double-Buffered Async Gradient Checkpointing

Async gradient checkpointing overlaps recomputation with computation. This gives an 8% speedup without impacting accuracy.

MoE Routing: argsort + bincount

For MoE models, using torch.argsort and torch.bincount instead of custom kernels speeds up gpt-oss training by 15%.

All optimizations are auto-enabled on supported hardware. Update Unsloth to get them.

📖 Read the full source: HN LLM Tools

Ad

👀 See Also