Krasis LLM Runtime Shows 8.9x Prefill and 4.7x Decode Speed Improvements Over Llama.cpp

✍️ OpenClawRadar📅 Published: March 17, 2026🔗 Source
Krasis LLM Runtime Shows 8.9x Prefill and 4.7x Decode Speed Improvements Over Llama.cpp
Ad

Performance Benchmarks

Krasis demonstrates significant performance improvements over llama.cpp when running on equivalent hardware. On a single 5090 GPU limited by PCIE 4.0, Krasis shows:

  • 8.9x faster prefill speed
  • 4.7x faster decode speed

Specific benchmark results for Qwen3-Coder-Next show Krasis running on a single 16GB 5080 GPU achieving:

  • 1801 tokens/sec prefill
  • 26.8 tokens/sec decode

This outperforms llama.cpp running on a 32GB 5090 GPU with layer offloading.

Architecture Changes

The latest version of Krasis has dropped the dual-format system and now runs both prefill and decode entirely on GPU with different optimization strategies for each phase. This architectural change results in:

  • Reduced CPU requirements
  • Less dependency on system RAM memory speed
  • Lower overall system RAM usage (now needs only enough for the quantized model plus some overhead, compared to the prior 2.5x model requirement)
Ad

Supported Models and Performance

Current supported models with their performance on a single 5090 GPU (PCIE 4.0) are:

  • Qwen3.5-35B-A3B: 4475 prefill, 109.1 decode
  • Qwen3-Coder-Next: 3560 prefill, 70.3 decode
  • Qwen3.5-122B-A10B: 2897 prefill, 27.7 decode
  • Qwen3-235B-A22B: 2124 prefill, 9.3 decode

Future Development Plans

The developer plans to:

  • Add support for Nvidia Nemotron models, specifically targeting Nemotron Super for consumer GPUs like the 5080
  • Potentially support larger Nemotron models when released
  • Expand IDE and tooling support for Opencode and Aider

Current Features

Krasis currently offers:

  • OpenAI-compatible server
  • Single-line installation
  • Availability on GitHub

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also