Benchmarking Nemotron 3 Super 120B with 1M token context on M1 Ultra

Local 1M Token Context Test with Nemotron 3 Super
A Reddit user conducted a benchmark test to evaluate the feasibility of processing 1 million token contexts locally using Nemotron 3 Super 120B on an M1 Ultra system. The test leveraged the model's hybrid mamba-2 architecture, which provides memory efficiency at increased context lengths.
Hardware and Setup Details
The test was run on an M1 Ultra using llama.cpp with the following configuration:
- Model: Nemotron-3-Super-120B-Q4_K.gguf (Q4_K_M quantization)
- Context allocation: Full 1 million tokens
- VRAM usage: Approximately 90GB
- Backend: MTL,BLAS with 1 thread
- Unified batch size: 2048
- Flash attention: Enabled (fa 1)
- GPU layers: 99 (-ngl 99)
Benchmark Command and Results
The user ran llama-bench with this command:
llama-bench -m ~/ml-models/huggingface/ggml-org/Nemotron-3-Super-120B-GGUF/Nemotron-3-Super-120B-Q4_K.gguf -fa 1 -t 1 -ngl 99 -b 2048 -ub 2048 -d 0,10000,20000,30000,40000,50000,60000,70000,80000,90000,100000,150000,200000,250000,1000000Key performance results from the benchmark:
- Prompt processing (pp512) at 0 context: 255.03 ± 0.36 tokens/second
- Token generation (tg128) at 0 context: 26.72 ± 0.02 tokens/second
- Prompt processing at 100,000 token context: 184.99 ± 0.19 tokens/second
- Token generation at 100,000 token context: 22.37 ± 0.01 tokens/second
- Prompt processing at 150,000 token context: 161.60 ± 0.22 tokens/second
- Token generation at 150,000 token context: 20.58 ± 0.01 tokens/second
- Prompt processing at 200,000 token context: 141.87 ± 0.19 tokens/second
The results show performance degradation as context length increases, with prompt processing speed dropping from 255 t/s at zero context to approximately 142 t/s at 200,000 tokens.
System Information
The Metal backend initialization showed:
- GPU name: MTL0
- GPU family: MTLGPUFamilyApple7 (1007)
- Has unified memory: true
- Has bfloat support: true
- Recommended max working set size: 134,217.73 MB
This test demonstrates that local processing of extremely large contexts (up to 1 million tokens) is technically possible with high-end Apple Silicon hardware and quantized models, though with significant memory requirements and performance trade-offs as context expands.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Chrome Extension Bridges Google Messages to Claude Code via MCP
A developer built a Chrome extension that connects Google Messages Web to Claude Code using MCP with stdio and WebSocket transport. The extension lists chats, reads messages, and drafts replies but currently can't send messages due to Angular's zone.js isolation.

Depct tool collects runtime data to help Claude debug production issues
Depct is a tool that collects runtime instrumentation from Node.js apps, builds graphs from the data, and feeds it to Claude via AWS Bedrock to help debug intermittent production failures. It also generates architecture diagrams and dependency maps from runtime behavior.

ClawPy: Minimal Single-File Python Implementation of OpenClaw with Experience Memory
A developer built ClawPy, a stripped-down Python script that implements OpenClaw's autonomous task execution mechanics with a persistent experience system that learns from past errors and successes.

apple-music-play OpenClaw skill published on ClawHub for Apple Music search and playback
The apple-music-play skill published on ClawHub enables searching Apple Music's online catalog and playing tracks directly in the macOS Music app, without requiring songs to be in your local library.