Krasis LLM Runtime Shows 8.9x Prefill and 4.7x Decode Speed Improvements Over Llama.cpp

Performance Benchmarks
Krasis demonstrates significant performance improvements over llama.cpp when running on equivalent hardware. On a single 5090 GPU limited by PCIE 4.0, Krasis shows:
- 8.9x faster prefill speed
- 4.7x faster decode speed
Specific benchmark results for Qwen3-Coder-Next show Krasis running on a single 16GB 5080 GPU achieving:
- 1801 tokens/sec prefill
- 26.8 tokens/sec decode
This outperforms llama.cpp running on a 32GB 5090 GPU with layer offloading.
Architecture Changes
The latest version of Krasis has dropped the dual-format system and now runs both prefill and decode entirely on GPU with different optimization strategies for each phase. This architectural change results in:
- Reduced CPU requirements
- Less dependency on system RAM memory speed
- Lower overall system RAM usage (now needs only enough for the quantized model plus some overhead, compared to the prior 2.5x model requirement)
Supported Models and Performance
Current supported models with their performance on a single 5090 GPU (PCIE 4.0) are:
- Qwen3.5-35B-A3B: 4475 prefill, 109.1 decode
- Qwen3-Coder-Next: 3560 prefill, 70.3 decode
- Qwen3.5-122B-A10B: 2897 prefill, 27.7 decode
- Qwen3-235B-A22B: 2124 prefill, 9.3 decode
Future Development Plans
The developer plans to:
- Add support for Nvidia Nemotron models, specifically targeting Nemotron Super for consumer GPUs like the 5080
- Potentially support larger Nemotron models when released
- Expand IDE and tooling support for Opencode and Aider
Current Features
Krasis currently offers:
- OpenAI-compatible server
- Single-line installation
- Availability on GitHub
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenJet v0.4: Zero-Config Local Coding Agent with llama.cpp Backend
OpenJet v0.4 is an open-source terminal coding agent for local LLMs that auto-detects hardware, configures llama.cpp, and provides a Claude Code-style workflow with no API keys.

OpenTabs: MCP Server with 100+ Plugins for Browser-Based AI Tool Access
OpenTabs is an MCP server and Chrome extension that exposes 100+ plugins with ~2,000 tools by hooking into web apps' internal APIs like Slack, Discord, and GitHub. It works with existing browser sessions, eliminating API keys and OAuth flows.

OpenClaw Skill Reduces Accessibility Tree Tokens from 600K to 1.3K
A developer built an OpenClaw skill that uses ML-based element ranking to prune accessibility trees, cutting slickdeals.com from ~598K tokens to ~1.3K tokens by keeping only the top ~50 actionable elements.

Claude Code Plugin 'nice-figures' Creates Research-Blog Style Matplotlib Plots
nice-figures is a Claude Code plugin that generates matplotlib figures matching Anthropic's soft-pastel research blog style. Includes 16 chart recipes, zero extra dependencies, and automatic styling.