Bodega Inference Engine: Optimizing LLM Inference for Apple Silicon's Unified Memory

Bodega is an inference engine designed specifically for Apple Silicon's unified memory architecture, built over 2.5 years with optimizations close to the Metal layer on MLX. It addresses the fundamental throughput limitations developers face when running LLMs on Mac hardware.
Why Apple Silicon Requires Different Optimization
Apple Silicon uses unified memory where CPU, GPU, and neural engine share one physical pool over a single on-chip bus. This differs fundamentally from discrete GPUs like NVIDIA's which have separate VRAM and system RAM pools connected by PCIe. Memory bandwidth ranges from ~400 GB/s on M1 Max to ~800 GB/s on M3 Ultra (with cross-die penalty reducing actual throughput to 1.6-1.8x single-die performance).
Key architectural implications:
- Decode is memory-bandwidth-bound - each token requires loading model weights from the shared bus
- Prefill is compute-bound - dominated by GPU TFLOPS for matrix-matrix multiplication
- The memory bus is shared with everything - KV cache, model weights, OS, and applications all compete for the same 400-800 GB/s bandwidth
This architecture makes direct ports of vLLM or llama.cpp's batching implementations ineffective on MLX, as they were designed for different memory architectures.
What Bodega Builds
The developer studied vLLM's core internals including continuous batching, speculative decoding, chunked prefill, and prefix caching, then rebuilt every component for MLX and Apple's unified memory model.
The core insight for continuous batching: generating a single token for a single sequence loads the full model weights for a matrix-vector multiply, which is wasteful on hardware with 400+ GB/s bandwidth. The solution runs multiple sequences simultaneously using weights × matrix of vectors instead of weights × single vector.
KV cache management was redesigned for unified memory where evicting cache blocks has different cost implications compared to isolated VRAM systems.
Practical Implications
The developer reports testing on multiple Apple Silicon configurations including two M3 Ultras (256GB and 512GB), an M4 Max 128GB, and an M1 Max 64GB. The common ceiling identified is single-user throughput with one request at a time and GPU sitting mostly idle.
The repository includes benchmarks that can be verified with a simple curl script for setup.
📖 Read the full source: r/LocalLLaMA
👀 See Also

AgentMind: A Claude Code Plugin That Learns and Applies Your Coding Preferences
AgentMind is a Claude Code plugin that observes your coding patterns, learns preferences like tool choices and style rules, and automatically injects that context into future sessions. It uses a six-step core loop and confidence scoring to determine when to apply learned preferences.

Session Search: Local Full-Text Search for Claude Code and Codex Sessions, Now in Your Menu Bar
Session Search indexes local Claude Code and Codex transcripts using SQLite FTS, enabling deep full-text search across errors, commands, filenames, and decisions—accessible from the macOS menu bar with highlighted snippets.

MCP Server for Italian Train Data: Real-Time Delays, Departures, and Schedules in Claude
A developer built an unofficial MCP server for Trenitalia that provides five tools for querying Italian train data through Claude, including real-time departure/arrival boards, train tracking, and schedules with live delay enrichment.

civStation: Open-Source VLM Harness for Natural Language Control of Civilization VI
civStation is an open-source computer-use stack that enables voice and natural language control of Civilization VI, translating high-level strategy commands into UI actions through a VLM-based observation and execution loop.