Krasis Hybrid CPU/GPU Runtime: 3,324 tok/s Prefill on RTX 5080

Krasis is a hybrid CPU/GPU runtime specifically designed for large Mixture-of-Experts (MoE) models. The core approach uses GPU for the computationally expensive prefill phase while CPU handles decode, with system RAM providing additional capacity to maximize performance.

Benchmark Results

RTX 5080 Configuration:

Hardware: AMD 5900X, DDR4-3200, 1x RTX 5080 16GB, PCIe 4.0 x16
Qwen3-Coder-Next (80B) Q4: 3,324 tok/s prefill, 9.7s TTFT (35K context), 14.9 tok/s decode

EPYC Configuration:

Hardware: AMD EPYC 7742 (64c), DDR4-2666 8-channel, 1x RTX 2000 Ada 16GB, PCIe 4.0 x8
Qwen3-Coder-Next (80B) Q4: 1,060 tok/s prefill, 18.9s TTFT, 15.8 tok/s decode
Qwen3-Coder-Next (80B) Q8: 873 tok/s prefill, 40.1s TTFT, 12.4 tok/s decode
Qwen3.5-35B-A3B Q4: 1,374 tok/s prefill, 14.6s TTFT, 15.0 tok/s decode
Qwen3-235B-A22B Q4: 289 tok/s prefill, 69.1s TTFT, 3.4 tok/s decode
DeepSeek V2-Lite (16B) Q4: 1,477 tok/s prefill, 13.6s TTFT, 20.2 tok/s decode
DeepSeek V2-Lite (16B) Q8: 1,317 tok/s prefill, 15.2s TTFT, 17.8 tok/s decode

Benchmarks used 10K–50K token prompts for prefill (best of 20K/35K/50K reported) and 64-token generation for decode (average of 3 runs).

How It Works

Unlike standard runtimes that offload only a few layers to GPU and run most of the model on CPU, Krasis treats the GPU as a streaming compute engine. It pushes the model through VRAM as fast as possible, hiding transfers under concurrent compute. The GPU handles the full prefill pass, then the CPU handles decode.

Tradeoffs

RAM hungry: Requires ~2.5x the quantized model weight in system RAM (e.g., ~100GB for Qwen3-Coder-Next at Q4)
NVIDIA cards only
Specifically targeted at MoE models (decode would be slow on dense models)
First run is slow due to preprocessing and caching
Disk hungry: Requires original BF16 safetensors file and stores cached transcoded models (~2x quantized model size)

Supported Models

Qwen3-Coder-Next (most thoroughly tested), Qwen3.5-35B-A3B, Qwen3-235B-A22B, and DeepSeek V2-Lite. Other models coming soon.

Technical Details

Written in Rust + Python (for orchestration)
OpenAI-compatible API (works with Cursor, OpenCode, etc.)
Interactive launcher for configuration
SSPL licensed (free to use, modify, distribute)
GitHub: https://github.com/brontoguana/krasis

The developer is seeking feedback on which models to support next, thoughts on the tradeoffs, and benchmarks from users with 5-series cards and PCIe 5.0.

📖 Read the full source: r/LocalLLaMA