Atlas Inference Engine Open Source: Rust + CUDA, 130 tok/s

The Atlas inference engine, previously teased hitting 102 tok/s on Qwen3.5-35B on a DGX Spark, is now open source on GitHub. Written in pure Rust and CUDA with no PyTorch or Python runtime, Atlas delivers a ~2.5 GB Docker image and sub-2-minute cold start. The team rewrote the full stack from HTTP handler to kernel dispatch to eliminate the 20+ GB Python overhead that was bottlenecking the GPU.

Key Benchmarks on DGX Spark (GB10)

Qwen3.5-35B (NVFP4, MTP K=2): 130 tok/s peak, ~111 tok/s sustained — 3.0–3.3× vLLM at testing time
Qwen3.5-122B (NVFP4, EP=2): ~50 tok/s decode
Qwen3-Next-80B-A3B (NVFP4, MTP): ~87 tok/s
Nemotron-3 Nano 30B (FP8): ~88 tok/s
Full model matrix including MiniMax2.7, Qwen3.6, Gemma available on the site

What Makes Atlas Different

Hand-tuned CUDA kernels for Blackwell SM120/121: attention, MoE, GDN, Mamba-2 — no generic fallbacks
Native NVFP4 + FP8 on tensor cores
MTP (Multi-Token Prediction) speculative decoding for up to 3× throughput on decode
OpenAI + Anthropic API compatibility on the same port — works with Claude Code, Cline, OpenCode, Open WebUI out of the box

Quick Start

docker pull avarok/atlas-gb10:latest
sudo docker run -d --name atlas --network host --gpus all --ipc=host \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    avarok/atlas-gb10:latest serve Qwen/Qwen3.6-35B-A3B-FP8 \
    --port 8888 --speculative --enable-prefix-caching

Roadmap & Community

The team is working on a Strix Halo port with Spectral Compute (AMD-provided hardware), and an RTX 6000 Pro Blackwell port is planned. The roadmap is community-driven — MiniMax M2.7 support landed from a Discord request. Atlas targets four chips well rather than twenty poorly.

For non-Spark users, the current binary is DGX Spark only, but the code is open for adaptation.

📖 Read the full source: r/LocalLLaMA