Hypura: Storage-Tier-Aware LLM Inference on Apple Silicon

What Hypura does

Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon that places model tensors across GPU, RAM, and NVMe tiers based on access patterns, bandwidth costs, and hardware capabilities. This enables models that exceed physical memory to run without crashing the system.

Key features and how it works

Hypura reads GGUF files, profiles your hardware (GPU working set, RAM, NVMe bandwidth), and solves a placement optimization that assigns every tensor to a tier:

GPU (Metal) — Attention layers, norms, embeddings
RAM — Overflow layers that don't fit in the GPU working set, accessed via mmap
NVMe — Remaining layers loaded on-demand via direct I/O (F_NOCACHE + pread), prefetched ahead of the forward pass

For MoE models like Mixtral, Hypura implements expert-streaming: only non-expert tensors (~1 GB) stay on GPU, while expert tensors stream from NVMe through a pool buffer on demand. It includes a neuron cache with 99.5% hit rate that eliminates most I/O after warmup, router interception to identify selected experts, and co-activation tracking to predict which experts will fire next for speculative prefetch.

For dense models like Llama 70B, it uses dense FFN-streaming: attention + norms stay on GPU (~8 GB) while FFN tensors (~32 GB) stream from NVMe through a dynamically-sized pool buffer with scaled prefetch lookahead.

Performance benchmarks

All benchmarks on M1 Max, 32 GB unified memory, ~5.1 GB/s NVMe sequential read:

Qwen 2.5 14B Q4_K_M (8.4 GB): Full-resident mode, 21 tok/s (same as llama.cpp)
Mixtral 8x7B Q5_K_M (30.9 GB): Expert-streaming mode, 2.2 tok/s (llama.cpp OOM)
Llama 3.3 70B Q4_K_M (39.6 GB): Dense-FFN-streaming mode, 0.3 tok/s (llama.cpp OOM)

Pool buffer size, prefetch depth, and memory budgets are computed automatically from your hardware profile — no manual tuning needed.