Nvidia's Nemotron 3 Super: 120B Parameter Model with 12B Active Inference

✍️ OpenClawRadar📅 Published: March 12, 2026🔗 Source
Nvidia's Nemotron 3 Super: 120B Parameter Model with 12B Active Inference
Ad

Nvidia released Nemotron 3 Super, a 120 billion parameter model that activates only 12 billion parameters during inference. This challenges the assumption that bigger models always mean better results by providing 120B model knowledge at roughly the compute cost of a 12B model. The model isn't approximating a larger one through compression - it's a 120B model that learned to route efficiently, with the other 108 billion parameters available when relevant and idle when not.

Architectural Decisions

Three key architectural decisions make this possible:

  • LatentMoE: Projects tokens into a compressed latent space before routing, making routing decisions cheaper. This allows activating 4x more experts for the same inference cost as standard MoE.
  • Hybrid Mamba-Attention: Replaces quadratically expensive transformer attention with Mamba-2 for most sequence processing, making the 1 million token context window practical rather than theoretical. Achieves 91.75% accuracy on RULER at 1M tokens.
  • Multi-Token Prediction: Generates multiple future tokens per forward pass, providing native speculative decoding up to 3x faster wall-clock inference without needing a separate draft model. Results in 5x higher throughput than its predecessor and outperforms models activating 3x more parameters per token.
Ad

Broader Trend

This is the third independent confirmation of this architectural approach. DeepSeek V3 first demonstrated this with 671B total parameters and 37B active, outperforming Llama 3 405B dense. Qwen3-Coder-Next followed with 80B total parameters and only 3B active at inference, matching Claude Sonnet 4.5 on SWE-Bench Pro and outperforming DeepSeek V3 which activates 37B per token. The efficiency gains compound rather than trade off - each architectural decision benefits more from scale than dense attention does, and the gap between this architecture and dense transformers grows as models scale.

The key insight from these three independent releases is that the path to capability isn't more activation - it's better routing. While parameter count leaderboards will continue publishing numbers, active parameters per token is becoming the more honest metric for comparing model efficiency and performance.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also