Krasis: Hybrid CPU/GPU Runtime for Large MoE Models Achieves 3,324 tok/s Prefill on RTX 5080

Krasis is a hybrid CPU/GPU runtime specifically designed for large Mixture-of-Experts (MoE) models. The core approach uses GPU for the computationally expensive prefill phase while CPU handles decode, with system RAM providing additional capacity to maximize performance.
Benchmark Results
RTX 5080 Configuration:
- Hardware: AMD 5900X, DDR4-3200, 1x RTX 5080 16GB, PCIe 4.0 x16
- Qwen3-Coder-Next (80B) Q4: 3,324 tok/s prefill, 9.7s TTFT (35K context), 14.9 tok/s decode
EPYC Configuration:
- Hardware: AMD EPYC 7742 (64c), DDR4-2666 8-channel, 1x RTX 2000 Ada 16GB, PCIe 4.0 x8
- Qwen3-Coder-Next (80B) Q4: 1,060 tok/s prefill, 18.9s TTFT, 15.8 tok/s decode
- Qwen3-Coder-Next (80B) Q8: 873 tok/s prefill, 40.1s TTFT, 12.4 tok/s decode
- Qwen3.5-35B-A3B Q4: 1,374 tok/s prefill, 14.6s TTFT, 15.0 tok/s decode
- Qwen3-235B-A22B Q4: 289 tok/s prefill, 69.1s TTFT, 3.4 tok/s decode
- DeepSeek V2-Lite (16B) Q4: 1,477 tok/s prefill, 13.6s TTFT, 20.2 tok/s decode
- DeepSeek V2-Lite (16B) Q8: 1,317 tok/s prefill, 15.2s TTFT, 17.8 tok/s decode
Benchmarks used 10K–50K token prompts for prefill (best of 20K/35K/50K reported) and 64-token generation for decode (average of 3 runs).
How It Works
Unlike standard runtimes that offload only a few layers to GPU and run most of the model on CPU, Krasis treats the GPU as a streaming compute engine. It pushes the model through VRAM as fast as possible, hiding transfers under concurrent compute. The GPU handles the full prefill pass, then the CPU handles decode.
Tradeoffs
- RAM hungry: Requires ~2.5x the quantized model weight in system RAM (e.g., ~100GB for Qwen3-Coder-Next at Q4)
- NVIDIA cards only
- Specifically targeted at MoE models (decode would be slow on dense models)
- First run is slow due to preprocessing and caching
- Disk hungry: Requires original BF16 safetensors file and stores cached transcoded models (~2x quantized model size)
Supported Models
Qwen3-Coder-Next (most thoroughly tested), Qwen3.5-35B-A3B, Qwen3-235B-A22B, and DeepSeek V2-Lite. Other models coming soon.
Technical Details
- Written in Rust + Python (for orchestration)
- OpenAI-compatible API (works with Cursor, OpenCode, etc.)
- Interactive launcher for configuration
- SSPL licensed (free to use, modify, distribute)
- GitHub: https://github.com/brontoguana/krasis
The developer is seeking feedback on which models to support next, thoughts on the tradeoffs, and benchmarks from users with 5-series cards and PCIe 5.0.
📖 Read the full source: r/LocalLLaMA
👀 See Also

NexQuant: Rust-native 3-bit KV-cache engine for edge deployment
NexQuant is a production-hardened Rust engine that enables running high-context models on consumer hardware with 3-5x memory reduction. It supports Metal, CUDA, Vulkan, and CPU backends.

AI Chat Exporter: A Chrome Extension for High-Fidelity Claude Conversation PDFs
A developer built AI Chat Exporter, a Chrome extension that preserves math, code, and images when exporting Claude conversations to PDF. The tool uses a local browser-based rendering engine developed with Claude 3.5 Sonnet to handle progressive markdown and LaTeX formatting.

Claude for Design Work: How to Stop Repeating the Same Taste Arguments Every Session
A developer running client work through Claude describes the core problem: Claude has no memory of rejected design decisions, leading to generic outputs and inconsistent brand identity.

Comparing OpenClaw and Claude Cowork: Local Automation vs Sandboxed Workflows
OpenClaw is an always-on local agent that runs on your machine with shell command execution and browser automation, while Claude Cowork operates within Claude Desktop in a sandboxed environment focused on document and browser tasks.