Blackwell LLM Toolkit: NVFP4 Configs, Wheels, and Benchmarks for TensorRT-LLM on RTX Pro 6000

A new repository on GitHub, blackwell-llm-toolkit, collects TensorRT-LLM configs, prebuilt wheels, and benchmark results for running LLMs on Nvidia Blackwell GPUs (RTX Pro 6000, 5090, 5080, 5070 Ti). The focus is on NVFP4 quantization and overcoming platform-specific hurdles.
Key Features
- TensorRT-LLM configs: Includes a YAML file (
configs/trtllm/nemotron-omni-v3-sm120.yaml) with the obscure launch flags needed to run Mamba-hybrid models on Blackwell. - LMCache wheels: The PyPI wheel crashed on Blackwell due to missing sm_120 cubins. The repo provides a rebuilt wheel and a build script, tested with Optane SSD for KV cache offloading.
- Research docs: AI-generated deep-dives on architecture differences in Nemotron Omni V3, Qwen 3.5/3.6, and Gemma 4. Notably, Qwen 3.5/3.6 are not just renamed Qwen3-VL — they have a completely different architecture.
- Benchmark harnesses:
rapid_bench.pyruns a 41-prompt quality eval (intelligence, tool-use, calibration, orchestration, creative writing).bench_harness.pymeasures sustained decode, TTFT, prefill, and concurrency, with a--prompt-tokens Nmode for long context.
Benchmark Highlights (Single RTX Pro 6000 96GB, no TP)
- Nemotron-3-Nano-Omni V3 (multimodal, NVFP4, 8k context): 270 tok/s. Fastest model tested, handles image/video/audio+text. Requires TRT-LLM v1.3.0rc13.
- Nemotron-3-Nano (text-only, NVFP4, 8k context): 249 tok/s. Best for tool-calling agents (10/10 on tools).
- DeepSeek-V4-Flash (IQ2_XXS-XL GGUF, 65k context): 31 tok/s. Best for complex reasoning (9/10 intel, 10/10 tools, 13/13 calibration).
- MiniMax-M2.7-REAP-172B (Q3_K_S GGUF, 196k context): 117 tok/s. Good for long conversations.
- MiniMax-M2.7 W4A16 (with LMCache on Optane SSD, 154k context): 20-22 tok/s. Long-context W4A16 quality.
- MiniMax-M2.7 W4A16 (short context, no LMCache, 64k context): 22-25 tok/s. Highest quality short answers (10/10 intel).
Full results with TTFT, prefill speeds, concurrency, and eval scores are in bench/results.md.
Who It's For
Developers and researchers running LLM inference on Blackwell GPUs who need optimized TensorRT-LLM configs, prebuilt LMCache for long-context offloading, or real-world benchmark data for model selection.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Implementing a Local Voice Assistant with Qwen3 on RTX 5060 Ti
A fully local home automation voice assistant using Qwen3 ASR, LLM, and TTS on an RTX 5060 Ti, featuring Morgan Freeman voice cloning and a variety of integration tools.

Open-source memory system for LLM agents achieves high benchmark scores
A persistent memory system for Claude Code and OpenClaw provides LLM agents with context continuity across sessions, achieving 90.8% on LoCoMo and 89.1% on LongMemEval benchmarks. The adapter-based architecture works with any agent framework.

Spectyra Plugin for OpenClaw: Real-Time AI Cost Optimization by Analyzing Full Request Flow
Spectyra plugin reduces AI API costs by surfacing hidden waste like repeated calls, excessive context, and expensive model misuse in real time.

nex-life-logger: Local Activity Tracker for OpenClaw Agents
nex-life-logger is a background activity tracker that runs locally on your machine, giving OpenClaw agents memory of your computer activities. It tracks browser history, active windows, and YouTube transcripts, storing everything in a local SQLite database with no cloud data transmission.