ZSE: Open-source LLM inference engine with 3.9-second cold starts

What ZSE does
ZSE (Z Server Engine) is an open-source LLM inference engine focused on memory efficiency and fast cold starts. It addresses the problem where running a 32B model normally requires ~64GB VRAM, and cold starts with bitsandbytes NF4 take 2+ minutes on first load.
Key performance improvements
ZSE fits 32B models in 19.3GB VRAM (70% reduction vs FP16) and runs on a single A100-40GB. For 7B models, it uses 5.2GB VRAM (63% reduction) and runs on consumer GPUs.
The cold start improvements are significant: 3.9s for 7B models and 21.4s for 32B models with the .zse format, compared to 45s and 120s with bitsandbytes. These benchmarks were verified on Modal A100-80GB in February 2026.
Technical approach
The cold start improvement comes from the .zse format storing pre-quantized weights as memory-mapped safetensors. This eliminates quantization at load time and weight conversion, using just mmap + GPU transfer. On NVMe SSDs, this gets under 4 seconds for 7B models.
Installation and usage
Install with: pip install zllm-zse
Basic server start: zse serve Qwen/Qwen2.5-7B-Instruct
For fast cold starts (one-time conversion):
zse convert Qwen/Qwen2.5-Coder-7B-Instruct -o qwen-7b.zse zse serve qwen-7b.zse # 3.9s every time
Features
- OpenAI-compatible API server (drop-in replacement)
- Interactive CLI (zse serve, zse chat, zse convert, zse hardware)
- Web dashboard with real-time GPU monitoring
- Continuous batching (3.45× throughput)
- GGUF support via llama.cpp CPU fallback — works without a GPU
- Rate limiting, audit logging, API key auth
Architecture components
- zAttention: Custom CUDA kernels for paged, flash, and sparse attention
- zQuantize: Per-tensor INT2-8 mixed precision quantization
- zKV: Quantized KV cache with sliding precision (4x memory savings)
- zStream: Layer streaming with async prefetch (run 70B on 24GB GPU)
- zOrchestrator: Smart recommendations based on FREE memory
Efficiency modes
- speed: Maximum throughput (production with ample GPU memory)
- balanced: Good throughput, moderate memory (standard deployment, default)
- memory: Low memory, reduced throughput (consumer GPUs)
- ultra: Extreme memory savings (4GB GPUs, laptops)
Supported models
Any HuggingFace transformers model, safetensors, GGUF, or .zse format. Popular choices include Qwen, Llama, Mistral, Phi, Gemma, DeepSeek, and Yi.
📖 Read the full source: HN LLM Tools
👀 See Also

Claude Code Plugin Launches DOOM in Terminal While AI Thinks
A developer created a Claude Code plugin that displays DOOM as a tmux popup overlay during AI processing. The plugin uses doom-ascii, a terminal-based DOOM source port, and automatically launches/dismisses with prompts.

MoltMarket: A Marketplace for Hiring AI Agents to Execute Digital Tasks
MoltMarket is a free platform where users can post jobs for AI agents to complete autonomously. The marketplace currently has 100+ users and verified agents that can handle tasks like web scraping, code generation, and content writing.

Local AI Image Critic Tool Uses Ollama Vision Models for Feedback
A developer has created a free desktop application that analyzes AI-generated images locally using Ollama vision models. The tool provides structured feedback reports including improvement suggestions and prompt upgrades.

I ripped out OpenClaw's default markdown memory and built a Node.js/Postgres API layer instead
A developer disabled OpenClaw's memory-core plugin and built a typed Node.js/Express + PostgreSQL backend. Context drift dropped to zero.