Krasis LLM Runtime Shows 8.9x Prefill and 4.7x Decode Speed Improvements Over Llama.cpp

✍️ OpenClawRadar📅 Published: March 17, 2026🔗 Source

Performance Benchmarks

Krasis demonstrates significant performance improvements over llama.cpp when running on equivalent hardware. On a single 5090 GPU limited by PCIE 4.0, Krasis shows:

8.9x faster prefill speed
4.7x faster decode speed

Specific benchmark results for Qwen3-Coder-Next show Krasis running on a single 16GB 5080 GPU achieving:

1801 tokens/sec prefill
26.8 tokens/sec decode

This outperforms llama.cpp running on a 32GB 5090 GPU with layer offloading.

Architecture Changes

The latest version of Krasis has dropped the dual-format system and now runs both prefill and decode entirely on GPU with different optimization strategies for each phase. This architectural change results in:

Reduced CPU requirements
Less dependency on system RAM memory speed
Lower overall system RAM usage (now needs only enough for the quantized model plus some overhead, compared to the prior 2.5x model requirement)

Supported Models and Performance

Current supported models with their performance on a single 5090 GPU (PCIE 4.0) are:

Qwen3.5-35B-A3B: 4475 prefill, 109.1 decode
Qwen3-Coder-Next: 3560 prefill, 70.3 decode
Qwen3.5-122B-A10B: 2897 prefill, 27.7 decode
Qwen3-235B-A22B: 2124 prefill, 9.3 decode

Future Development Plans

The developer plans to:

Add support for Nvidia Nemotron models, specifically targeting Nemotron Super for consumer GPUs like the 5080
Potentially support larger Nemotron models when released
Expand IDE and tooling support for Opencode and Aider

Current Features

Krasis currently offers:

OpenAI-compatible server
Single-line installation
Availability on GitHub

📖 Read the full source: r/LocalLLaMA

👀 See Also

Tools

OpenJet v0.4: Zero-Config Local Coding Agent with llama.cpp Backend

OpenJet v0.4 is an open-source terminal coding agent for local LLMs that auto-detects hardware, configures llama.cpp, and provides a Claude Code-style workflow with no API keys.

May 2, 2026, 02:15 PM UTC

OpenClawRadar

Tools

OpenTabs: MCP Server with 100+ Plugins for Browser-Based AI Tool Access

OpenTabs is an MCP server and Chrome extension that exposes 100+ plugins with ~2,000 tools by hooking into web apps' internal APIs like Slack, Discord, and GitHub. It works with existing browser sessions, eliminating API keys and OAuth flows.

Mar 14, 2026, 12:45 AM UTC

OpenClawRadar

Tools

OpenClaw Skill Reduces Accessibility Tree Tokens from 600K to 1.3K

A developer built an OpenClaw skill that uses ML-based element ranking to prune accessibility trees, cutting slickdeals.com from ~598K tokens to ~1.3K tokens by keeping only the top ~50 actionable elements.

Feb 27, 2026, 07:45 AM UTC

OpenClawRadar

Tools

Claude Code Plugin 'nice-figures' Creates Research-Blog Style Matplotlib Plots

nice-figures is a Claude Code plugin that generates matplotlib figures matching Anthropic's soft-pastel research blog style. Includes 16 chart recipes, zero extra dependencies, and automatic styling.

Jun 6, 2026, 12:20 PM UTC

OpenClawRadar