Atlas Inference Engine Goes Open Source: Pure Rust + CUDA, 100+ tok/s on DGX Spark

The Atlas inference engine, previously teased hitting 102 tok/s on Qwen3.5-35B on a DGX Spark, is now open source on GitHub. Written in pure Rust and CUDA with no PyTorch or Python runtime, Atlas delivers a ~2.5 GB Docker image and sub-2-minute cold start. The team rewrote the full stack from HTTP handler to kernel dispatch to eliminate the 20+ GB Python overhead that was bottlenecking the GPU.
Key Benchmarks on DGX Spark (GB10)
- Qwen3.5-35B (NVFP4, MTP K=2): 130 tok/s peak, ~111 tok/s sustained — 3.0–3.3× vLLM at testing time
- Qwen3.5-122B (NVFP4, EP=2): ~50 tok/s decode
- Qwen3-Next-80B-A3B (NVFP4, MTP): ~87 tok/s
- Nemotron-3 Nano 30B (FP8): ~88 tok/s
- Full model matrix including MiniMax2.7, Qwen3.6, Gemma available on the site
What Makes Atlas Different
- Hand-tuned CUDA kernels for Blackwell SM120/121: attention, MoE, GDN, Mamba-2 — no generic fallbacks
- Native NVFP4 + FP8 on tensor cores
- MTP (Multi-Token Prediction) speculative decoding for up to 3× throughput on decode
- OpenAI + Anthropic API compatibility on the same port — works with Claude Code, Cline, OpenCode, Open WebUI out of the box
Quick Start
docker pull avarok/atlas-gb10:latest
sudo docker run -d --name atlas --network host --gpus all --ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
avarok/atlas-gb10:latest serve Qwen/Qwen3.6-35B-A3B-FP8 \
--port 8888 --speculative --enable-prefix-caching
Roadmap & Community
The team is working on a Strix Halo port with Spectral Compute (AMD-provided hardware), and an RTX 6000 Pro Blackwell port is planned. The roadmap is community-driven — MiniMax M2.7 support landed from a Discord request. Atlas targets four chips well rather than twenty poorly.
For non-Spark users, the current binary is DGX Spark only, but the code is open for adaptation.
📖 Read the full source: r/LocalLLaMA
👀 See Also

ClawControl v1.7.1 fixes daily usage issues in OpenClaw client
ClawControl v1.7.1 is an open source client for OpenClaw available on Windows, Mac, Linux, iOS, and Android. This release focuses on fixing 'why is it doing that?' issues encountered during daily OpenClaw usage.

engram: Claude memory plugin with salience-gated capture and dream cycles
engram is a Claude memory plugin that filters observations at capture time using 5 salience dimensions, persisting only high-scoring events to SQLite with no LLM calls in scoring. It features automatic injection through 5 hooks and dream cycles that extract recurring workflows at session end.

Karpathy's Autoresearch Ported to Apple Neural Engine for Better Throughput per Watt
A prototype combines Andrej Karpathy's autoresearch project with reverse-engineered Apple Neural Engine performance, aiming for better throughput per watt compared to official APIs. The project is built on existing GitHub repositories and acknowledges contributions from multiple developers.

Remark: A Markdown Annotation Tool for Claude Code Workflows
Remark is a native macOS app that lets developers annotate Markdown files inline for Claude Code review workflows. It exports annotations as JSON for the agent and integrates via a skill installed in the .claude/skills/ directory.