Benchmarking Nemotron 3 Super 120B with 1M token context on M1 Ultra

✍️ OpenClawRadar📅 Published: March 12, 2026🔗 Source
Benchmarking Nemotron 3 Super 120B with 1M token context on M1 Ultra
Ad

Local 1M Token Context Test with Nemotron 3 Super

A Reddit user conducted a benchmark test to evaluate the feasibility of processing 1 million token contexts locally using Nemotron 3 Super 120B on an M1 Ultra system. The test leveraged the model's hybrid mamba-2 architecture, which provides memory efficiency at increased context lengths.

Hardware and Setup Details

The test was run on an M1 Ultra using llama.cpp with the following configuration:

  • Model: Nemotron-3-Super-120B-Q4_K.gguf (Q4_K_M quantization)
  • Context allocation: Full 1 million tokens
  • VRAM usage: Approximately 90GB
  • Backend: MTL,BLAS with 1 thread
  • Unified batch size: 2048
  • Flash attention: Enabled (fa 1)
  • GPU layers: 99 (-ngl 99)

Benchmark Command and Results

The user ran llama-bench with this command:

llama-bench -m ~/ml-models/huggingface/ggml-org/Nemotron-3-Super-120B-GGUF/Nemotron-3-Super-120B-Q4_K.gguf -fa 1 -t 1 -ngl 99 -b 2048 -ub 2048 -d 0,10000,20000,30000,40000,50000,60000,70000,80000,90000,100000,150000,200000,250000,1000000

Key performance results from the benchmark:

  • Prompt processing (pp512) at 0 context: 255.03 ± 0.36 tokens/second
  • Token generation (tg128) at 0 context: 26.72 ± 0.02 tokens/second
  • Prompt processing at 100,000 token context: 184.99 ± 0.19 tokens/second
  • Token generation at 100,000 token context: 22.37 ± 0.01 tokens/second
  • Prompt processing at 150,000 token context: 161.60 ± 0.22 tokens/second
  • Token generation at 150,000 token context: 20.58 ± 0.01 tokens/second
  • Prompt processing at 200,000 token context: 141.87 ± 0.19 tokens/second

The results show performance degradation as context length increases, with prompt processing speed dropping from 255 t/s at zero context to approximately 142 t/s at 200,000 tokens.

Ad

System Information

The Metal backend initialization showed:

  • GPU name: MTL0
  • GPU family: MTLGPUFamilyApple7 (1007)
  • Has unified memory: true
  • Has bfloat support: true
  • Recommended max working set size: 134,217.73 MB

This test demonstrates that local processing of extremely large contexts (up to 1 million tokens) is technically possible with high-end Apple Silicon hardware and quantized models, though with significant memory requirements and performance trade-offs as context expands.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also