Unsloth and NVIDIA Collaborate to Speed Up LLM Training by ~25%

Unsloth's collaboration with NVIDIA yields a ~25% training speedup (no accuracy loss) by implementing three key optimizations: caching packed-sequence metadata, double-buffered async gradient checkpointing, and MoE routing improvements. These are auto-enabled on RTX laptops, data center GPUs, and DGX Spark with an Unsloth update.
Caching Packed-Sequence Metadata
Packed training concatenates short examples to avoid padding waste. Each transformer layer previously rebuilt the same sequence metadata (lengths, cu_seqlens, max_seqlen, mask structure) from scratch, causing device-host synchronization overhead. By caching the metadata once per batch and reusing it across layers, Unsloth reduces repeated work.
Benchmarks on Qwen3-14B QLoRA SFT show:
- Forward pass: +43.3% faster
- Backward pass: +5.8% faster
- Overall per batch: +14.3% faster
A microbenchmark on NVIDIA Blackwell GPUs measured the dominant mask-construction cost at ~13.7 ms per packed batch. For Llama-3.2-1B (16 layers), this translates to ~199 ms saved per step (11.5% lower); for Qwen3-0.6B (28 layers), ~319 ms saved (14.8% lower).
Double-Buffered Async Gradient Checkpointing
Async gradient checkpointing overlaps recomputation with computation. This gives an 8% speedup without impacting accuracy.
MoE Routing: argsort + bincount
For MoE models, using torch.argsort and torch.bincount instead of custom kernels speeds up gpt-oss training by 15%.
All optimizations are auto-enabled on supported hardware. Update Unsloth to get them.
📖 Read the full source: HN LLM Tools
👀 See Also

Qwen 3.5 Chat Template Release with 21 Bug Fixes for Agent Workflows
A developer has released a fixed chat template for Qwen 3.5 models, addressing 21 bugs including tool calling crashes, parallel call separation, and agent loop stability. It's a drop-in replacement tested on llama.cpp, Open WebUI, vLLM, and other platforms.

Code Decisions: Open Source Claude Plugin Captures Technical Decisions
Code Decisions is an open source Claude Code plugin that captures technical decisions from conversations and surfaces them when affected files are edited. It writes decisions as markdown files to .claude/decisions/ with an affects field pointing to governed files.

The Bottleneck in Parallel AI Agents: Human Approval Queue Bottleself
A developer running parallel Claude Code agents describes the 'bottleself' — the point where parallelism stops adding output and starts creating a backlog of human approvals. Their solution: a planner that decomposes goals into subtasks, spawns agents, and only pings on unresolved decisions.

Reseed CLI: Extract Design Systems from Any Site for Claude Code and Cursor
Reseed is a CLI that extracts design tokens (colors, spacing, type scale, radii) from any website and generates a tailwind.config.ts, design-system.md, and reference HTML for Claude Code and Cursor to use.