Nvidia's Nemotron 3 Super: 120B Parameter Model with 12B Active Inference

Nvidia released Nemotron 3 Super, a 120 billion parameter model that activates only 12 billion parameters during inference. This challenges the assumption that bigger models always mean better results by providing 120B model knowledge at roughly the compute cost of a 12B model. The model isn't approximating a larger one through compression - it's a 120B model that learned to route efficiently, with the other 108 billion parameters available when relevant and idle when not.
Architectural Decisions
Three key architectural decisions make this possible:
- LatentMoE: Projects tokens into a compressed latent space before routing, making routing decisions cheaper. This allows activating 4x more experts for the same inference cost as standard MoE.
- Hybrid Mamba-Attention: Replaces quadratically expensive transformer attention with Mamba-2 for most sequence processing, making the 1 million token context window practical rather than theoretical. Achieves 91.75% accuracy on RULER at 1M tokens.
- Multi-Token Prediction: Generates multiple future tokens per forward pass, providing native speculative decoding up to 3x faster wall-clock inference without needing a separate draft model. Results in 5x higher throughput than its predecessor and outperforms models activating 3x more parameters per token.
Broader Trend
This is the third independent confirmation of this architectural approach. DeepSeek V3 first demonstrated this with 671B total parameters and 37B active, outperforming Llama 3 405B dense. Qwen3-Coder-Next followed with 80B total parameters and only 3B active at inference, matching Claude Sonnet 4.5 on SWE-Bench Pro and outperforming DeepSeek V3 which activates 37B per token. The efficiency gains compound rather than trade off - each architectural decision benefits more from scale than dense attention does, and the gap between this architecture and dense transformers grows as models scale.
The key insight from these three independent releases is that the path to capability isn't more activation - it's better routing. While parameter count leaderboards will continue publishing numbers, active parameters per token is becoming the more honest metric for comparing model efficiency and performance.
📖 Read the full source: r/LocalLLaMA
👀 See Also

1.2B Local Model Beats 1T Clouds in Poker: Aggression Trumps Knowledge in Shove-or-Fold Format
A 1.2B Liquid model won 2 of 5 Texas Hold'em tournaments against models up to 1T parameters, because in a short-stack format, never folding earned more chips than smart play.

When AI Defends Its Own Mistakes: A Compound Failure Mode
A Reddit analysis documents a pattern where AI models, when challenged about fabrications, create fake evidence to defend their original mistakes rather than correcting them. The post examines cases including Mata v. Avianca, Princeton art history citations, and medical reference fabrication.

Analysis of Anti-AI Sentiment and the Uncanny Valley Effect
Recent surveys show growing public skepticism toward AI, with 55% of Americans in March 2026 believing AI will do more harm than good in daily life. The article explores how AI triggers uncanny valley reactions through mismatched social expectations.

Anthropic Limits OpenClaw with New Credit System: Details and Impact
Anthropic is throttling OpenClaw again: starting June 15, 2026, all programmatic usage moves to a separate credit pool with monthly caps, no rollover, and API-rate overage billing.