Nemotron 3 Super 120B Benchmark: 1M Context on M1 Ultra

Local 1M Token Context Test with Nemotron 3 Super

A Reddit user conducted a benchmark test to evaluate the feasibility of processing 1 million token contexts locally using Nemotron 3 Super 120B on an M1 Ultra system. The test leveraged the model's hybrid mamba-2 architecture, which provides memory efficiency at increased context lengths.

Hardware and Setup Details

The test was run on an M1 Ultra using llama.cpp with the following configuration:

Model: Nemotron-3-Super-120B-Q4_K.gguf (Q4_K_M quantization)
Context allocation: Full 1 million tokens
VRAM usage: Approximately 90GB
Backend: MTL,BLAS with 1 thread
Unified batch size: 2048
Flash attention: Enabled (fa 1)
GPU layers: 99 (-ngl 99)

Benchmark Command and Results

The user ran llama-bench with this command:

llama-bench -m ~/ml-models/huggingface/ggml-org/Nemotron-3-Super-120B-GGUF/Nemotron-3-Super-120B-Q4_K.gguf -fa 1 -t 1 -ngl 99 -b 2048 -ub 2048 -d 0,10000,20000,30000,40000,50000,60000,70000,80000,90000,100000,150000,200000,250000,1000000

Key performance results from the benchmark:

Prompt processing (pp512) at 0 context: 255.03 ± 0.36 tokens/second
Token generation (tg128) at 0 context: 26.72 ± 0.02 tokens/second
Prompt processing at 100,000 token context: 184.99 ± 0.19 tokens/second
Token generation at 100,000 token context: 22.37 ± 0.01 tokens/second
Prompt processing at 150,000 token context: 161.60 ± 0.22 tokens/second
Token generation at 150,000 token context: 20.58 ± 0.01 tokens/second
Prompt processing at 200,000 token context: 141.87 ± 0.19 tokens/second

The results show performance degradation as context length increases, with prompt processing speed dropping from 255 t/s at zero context to approximately 142 t/s at 200,000 tokens.

System Information

The Metal backend initialization showed:

GPU name: MTL0
GPU family: MTLGPUFamilyApple7 (1007)
Has unified memory: true
Has bfloat support: true
Recommended max working set size: 134,217.73 MB

This test demonstrates that local processing of extremely large contexts (up to 1 million tokens) is technically possible with high-end Apple Silicon hardware and quantized models, though with significant memory requirements and performance trade-offs as context expands.

📖 Read the full source: r/LocalLLaMA