Bodega Inference Engine: Optimizing LLM Inference for Apple Silicon's Unified Memory

✍️ OpenClawRadar📅 Published: March 19, 2026🔗 Source
Bodega Inference Engine: Optimizing LLM Inference for Apple Silicon's Unified Memory
Ad

Bodega is an inference engine designed specifically for Apple Silicon's unified memory architecture, built over 2.5 years with optimizations close to the Metal layer on MLX. It addresses the fundamental throughput limitations developers face when running LLMs on Mac hardware.

Why Apple Silicon Requires Different Optimization

Apple Silicon uses unified memory where CPU, GPU, and neural engine share one physical pool over a single on-chip bus. This differs fundamentally from discrete GPUs like NVIDIA's which have separate VRAM and system RAM pools connected by PCIe. Memory bandwidth ranges from ~400 GB/s on M1 Max to ~800 GB/s on M3 Ultra (with cross-die penalty reducing actual throughput to 1.6-1.8x single-die performance).

Key architectural implications:

  • Decode is memory-bandwidth-bound - each token requires loading model weights from the shared bus
  • Prefill is compute-bound - dominated by GPU TFLOPS for matrix-matrix multiplication
  • The memory bus is shared with everything - KV cache, model weights, OS, and applications all compete for the same 400-800 GB/s bandwidth

This architecture makes direct ports of vLLM or llama.cpp's batching implementations ineffective on MLX, as they were designed for different memory architectures.

Ad

What Bodega Builds

The developer studied vLLM's core internals including continuous batching, speculative decoding, chunked prefill, and prefix caching, then rebuilt every component for MLX and Apple's unified memory model.

The core insight for continuous batching: generating a single token for a single sequence loads the full model weights for a matrix-vector multiply, which is wasteful on hardware with 400+ GB/s bandwidth. The solution runs multiple sequences simultaneously using weights × matrix of vectors instead of weights × single vector.

KV cache management was redesigned for unified memory where evicting cache blocks has different cost implications compared to isolated VRAM systems.

Practical Implications

The developer reports testing on multiple Apple Silicon configurations including two M3 Ultras (256GB and 512GB), an M4 Max 128GB, and an M1 Max 64GB. The common ceiling identified is single-user throughput with one request at a time and GPU sitting mostly idle.

The repository includes benchmarks that can be verified with a simple curl script for setup.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also