Hypura: Storage-tier-aware LLM inference scheduler for Apple Silicon

What Hypura does
Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon that places model tensors across GPU, RAM, and NVMe tiers based on access patterns, bandwidth costs, and hardware capabilities. This enables models that exceed physical memory to run without crashing the system.
Key features and how it works
Hypura reads GGUF files, profiles your hardware (GPU working set, RAM, NVMe bandwidth), and solves a placement optimization that assigns every tensor to a tier:
- GPU (Metal) — Attention layers, norms, embeddings
- RAM — Overflow layers that don't fit in the GPU working set, accessed via mmap
- NVMe — Remaining layers loaded on-demand via direct I/O (
F_NOCACHE + pread), prefetched ahead of the forward pass
For MoE models like Mixtral, Hypura implements expert-streaming: only non-expert tensors (~1 GB) stay on GPU, while expert tensors stream from NVMe through a pool buffer on demand. It includes a neuron cache with 99.5% hit rate that eliminates most I/O after warmup, router interception to identify selected experts, and co-activation tracking to predict which experts will fire next for speculative prefetch.
For dense models like Llama 70B, it uses dense FFN-streaming: attention + norms stay on GPU (~8 GB) while FFN tensors (~32 GB) stream from NVMe through a dynamically-sized pool buffer with scaled prefetch lookahead.
Performance benchmarks
All benchmarks on M1 Max, 32 GB unified memory, ~5.1 GB/s NVMe sequential read:
- Qwen 2.5 14B Q4_K_M (8.4 GB): Full-resident mode, 21 tok/s (same as llama.cpp)
- Mixtral 8x7B Q5_K_M (30.9 GB): Expert-streaming mode, 2.2 tok/s (llama.cpp OOM)
- Llama 3.3 70B Q4_K_M (39.6 GB): Dense-FFN-streaming mode, 0.3 tok/s (llama.cpp OOM)
Pool buffer size, prefetch depth, and memory budgets are computed automatically from your hardware profile — no manual tuning needed.
Installation
Hypura builds from source with Cargo. You'll need Rust 1.75+ and CMake.
📖 Read the full source: HN AI Agents
👀 See Also

MCP Search Server with Feedback-Driven Ranking for Claude Desktop
A community-built MCP search server for Claude Desktop runs Exa and Tavily search engines in parallel without requiring API keys. After using a result, users report whether it worked via an outcome tool, which feeds back into ranking to prioritize URLs that help agents succeed.

ClawCut: A Python Proxy That Makes Small Local LLMs Usable with OpenClaw
ClawCut is a Python Flask proxy that solves common problems when connecting 7B/14B local models to OpenClaw, including context poisoning, infinite loops, and failed cron job outputs. It implements dynamic amnesia during tool calls and auto-delivery for scheduled tasks.

Two New Open Source Tools for AI Agent Security and Optimization
Two open source tools are available for AI agent developers: AI Agent Defense Kit provides runtime security skills, and AgentGuard (in development) offers cost tracking, security scanning, and activity monitoring.

Claude Code fails silently when ANTHROPIC_API_KEY is set in cloud environments
Setting ANTHROPIC_API_KEY in cloud environments causes Claude Code to malfunction and may incur unexpected API usage charges. Users report extra usage and unresponsive behavior.