ZSE: Open-Source LLM Inference Engine With 3.9s Cold Starts

What ZSE does

ZSE (Z Server Engine) is an open-source LLM inference engine focused on memory efficiency and fast cold starts. It addresses the problem where running a 32B model normally requires ~64GB VRAM, and cold starts with bitsandbytes NF4 take 2+ minutes on first load.

Key performance improvements

ZSE fits 32B models in 19.3GB VRAM (70% reduction vs FP16) and runs on a single A100-40GB. For 7B models, it uses 5.2GB VRAM (63% reduction) and runs on consumer GPUs.

The cold start improvements are significant: 3.9s for 7B models and 21.4s for 32B models with the .zse format, compared to 45s and 120s with bitsandbytes. These benchmarks were verified on Modal A100-80GB in February 2026.

Technical approach

The cold start improvement comes from the .zse format storing pre-quantized weights as memory-mapped safetensors. This eliminates quantization at load time and weight conversion, using just mmap + GPU transfer. On NVMe SSDs, this gets under 4 seconds for 7B models.

Installation and usage

Install with: pip install zllm-zse

Basic server start: zse serve Qwen/Qwen2.5-7B-Instruct

For fast cold starts (one-time conversion):

zse convert Qwen/Qwen2.5-Coder-7B-Instruct -o qwen-7b.zse
zse serve qwen-7b.zse  # 3.9s every time

Features

OpenAI-compatible API server (drop-in replacement)
Interactive CLI (zse serve, zse chat, zse convert, zse hardware)
Web dashboard with real-time GPU monitoring
Continuous batching (3.45× throughput)
GGUF support via llama.cpp CPU fallback — works without a GPU
Rate limiting, audit logging, API key auth

Architecture components

zAttention: Custom CUDA kernels for paged, flash, and sparse attention
zQuantize: Per-tensor INT2-8 mixed precision quantization
zKV: Quantized KV cache with sliding precision (4x memory savings)
zStream: Layer streaming with async prefetch (run 70B on 24GB GPU)
zOrchestrator: Smart recommendations based on FREE memory

Efficiency modes

speed: Maximum throughput (production with ample GPU memory)
balanced: Good throughput, moderate memory (standard deployment, default)
memory: Low memory, reduced throughput (consumer GPUs)
ultra: Extreme memory savings (4GB GPUs, laptops)

Supported models

Any HuggingFace transformers model, safetensors, GGUF, or .zse format. Popular choices include Qwen, Llama, Mistral, Phi, Gemma, DeepSeek, and Yi.

📖 Read the full source: HN LLM Tools

ZSE: Open-source LLM inference engine with 3.9-second cold starts

What ZSE does

Key performance improvements

Technical approach

Installation and usage

Features

Architecture components

Efficiency modes

Supported models

👀 See Also

Claude Code Plugin Launches DOOM in Terminal While AI Thinks

MoltMarket: A Marketplace for Hiring AI Agents to Execute Digital Tasks

Local AI Image Critic Tool Uses Ollama Vision Models for Feedback

I ripped out OpenClaw's default markdown memory and built a Node.js/Postgres API layer instead