NVIDIA Releases Nemotron-3-Ultra-550B: 55B Active Parameters, 1M Context, LatentMoE Hybrid

NVIDIA released Nemotron-3-Ultra-550B-A55B-BF16, a frontier-scale LLM with 550B total parameters and 55B active. The model uses a hybrid Latent Mixture-of-Experts (LatentMoE) architecture that interleaves Mamba-2, MoE, and attention layers, plus Multi-Token Prediction (MTP) for faster generation. Context length reaches up to 1M tokens.
Key Specs
- Architecture: LatentMoE hybrid – Mamba-2 + MoE + Attention + MTP
- Parameters: 550B total / 55B active
- Context: Up to 1M tokens
- Min GPU: 8x GB200/B200/GB300/B300, 16x H100, 8x H200
- Languages: English, French, Spanish, Italian, German, Japanese, Korean, Hindi, Brazilian Portuguese, Chinese
- Reasoning: Configurable on/off via chat template (
enable_thinking=True/False) - License: OpenMDW License Agreement v1.1
The model is built for frontier reasoning, complex agentic workflows, long-context analysis, tool use, multilingual reasoning, and high-stakes RAG. It's trained with NVFP4 pre-training recipe for compute efficiency. Open weights, training data, and recipes are included under the OpenMDW license. For local inference, you'll need at least 8x H200 or equivalent.
📖 Read the full source: r/LocalLLaMA
👀 See Also

EU Forces Google to Open Android AI to Third Parties Under DMA
European Commission proposes measures to allow third-party AI assistants system-level access on Android, including hot word invocation, screen context, and local model hardware access. Google calls it 'unwarranted intervention'.

The First Step to AGI: Bridging the Gap with ClawDBot
Explore how ClawDBot advances us towards AGI by enhancing AI coding agents, showcasing a pivotal step in AI evolution.

Local vs Cloud Models: Qwen-3.6-27B, Gemma-4-31B, Claude Haiku, Codex-Spark on Hard Code Gen
A user tested Qwen-3.6-27B (q4_k_m) locally on an RTX 5080 against API-based Gemma-4-31B, Claude Haiku 4.5, and Codex-Spark on a complex code task. Only Codex-Spark produced complete code (but with import errors); all others failed partially. Cost: Gemma used $0.112 for 803k input tokens.

Reddit user shares bizarre AI persona portability story from Vanity Fair article
A Reddit post discusses a Vanity Fair article anecdote where a woman attempted to port her AI companion 'Max' from ChatGPT to Claude, resulting in unexpected behavior from Claude.