Hypura: Storage-tier-aware LLM inference scheduler for Apple Silicon

✍️ OpenClawRadar📅 Published: March 24, 2026🔗 Source
Hypura: Storage-tier-aware LLM inference scheduler for Apple Silicon
Ad

What Hypura does

Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon that places model tensors across GPU, RAM, and NVMe tiers based on access patterns, bandwidth costs, and hardware capabilities. This enables models that exceed physical memory to run without crashing the system.

Key features and how it works

Hypura reads GGUF files, profiles your hardware (GPU working set, RAM, NVMe bandwidth), and solves a placement optimization that assigns every tensor to a tier:

  • GPU (Metal) — Attention layers, norms, embeddings
  • RAM — Overflow layers that don't fit in the GPU working set, accessed via mmap
  • NVMe — Remaining layers loaded on-demand via direct I/O (F_NOCACHE + pread), prefetched ahead of the forward pass

For MoE models like Mixtral, Hypura implements expert-streaming: only non-expert tensors (~1 GB) stay on GPU, while expert tensors stream from NVMe through a pool buffer on demand. It includes a neuron cache with 99.5% hit rate that eliminates most I/O after warmup, router interception to identify selected experts, and co-activation tracking to predict which experts will fire next for speculative prefetch.

For dense models like Llama 70B, it uses dense FFN-streaming: attention + norms stay on GPU (~8 GB) while FFN tensors (~32 GB) stream from NVMe through a dynamically-sized pool buffer with scaled prefetch lookahead.

Ad

Performance benchmarks

All benchmarks on M1 Max, 32 GB unified memory, ~5.1 GB/s NVMe sequential read:

  • Qwen 2.5 14B Q4_K_M (8.4 GB): Full-resident mode, 21 tok/s (same as llama.cpp)
  • Mixtral 8x7B Q5_K_M (30.9 GB): Expert-streaming mode, 2.2 tok/s (llama.cpp OOM)
  • Llama 3.3 70B Q4_K_M (39.6 GB): Dense-FFN-streaming mode, 0.3 tok/s (llama.cpp OOM)

Pool buffer size, prefetch depth, and memory budgets are computed automatically from your hardware profile — no manual tuning needed.

Installation

Hypura builds from source with Cargo. You'll need Rust 1.75+ and CMake.

📖 Read the full source: HN AI Agents

Ad

👀 See Also