Claude 4.6 Opus Reasoning Distilled to 14GB for Apple Silicon via MLX Quantization

A developer has successfully quantized a local AI model that brings Claude 4.6 Opus's reasoning capabilities to Apple Silicon hardware, significantly reducing its memory footprint while maintaining performance.
The Model and Its Origin
The work centers on Qwen 3.5 27B, specifically a version distilled from Claude 4.6 Opus reasoning trajectories. The developer sought a model that could "think" rather than just autocomplete code, describing Opus's signature as "deliberate, analytical, and catches the subtle architectural flaws that other models miss." This distilled version brings that "thinking" scaffold to an open-weight architecture.
The Quantization Process
The original model was 55.6GB in BF16 format, which the developer noted is a "non-starter" for most local setups as it consumes the entire memory pool. To address this, they used MLX to quantize the model for Apple Silicon, converting it to 4-bit precision. The goal was to maintain high-fidelity Opus reasoning while making it lean enough for daily use in technical planning and complex logic.
Results and Performance
- Footprint: Reduced from 55GB to 14GB
- Speed: ~16 tokens/second on an M4 Pro
- Reasoning: Maintains the full <think> block, allowing the model to "talk to itself" to verify logic, simulate edge cases, and self-correct before presenting final answers
Availability and Requirements
The developer has uploaded the weights to Hugging Face. The model requires a Mac with 24GB+ of RAM to run private, high-tier logic and technical planning completely offline.
📖 Read the full source: r/LocalLLaMA
👀 See Also

DELIGHT: Local Orchestrator Uses Multiple ChatGPT Sessions as Coordinated Agents
DELIGHT is a local orchestrator that runs multiple hidden ChatGPT browser sessions simultaneously, coordinating them like a team of agents without requiring API keys or GPU resources. It connects to OpenClaw as an action layer to apply changes to real files and run tests.

Memora v0.2.25 MCP Server: 5× Faster Writes on D1 Database
Memora v0.2.25, an MCP server for Claude persistent memory, achieves 5× faster writes on Cloudflare D1 with memory_create dropping from 10s+ to ~1.8s and memory_update from 10s+ to ~1.1s per call.

Claude Code Voice Mode: Hands-Free AI Conversations for Developers
Claude's voice mode beta lets you speak to the AI and hear responses, with hands-free and push-to-talk options. It works on web and mobile, counts toward regular usage limits, and allows switching between text and voice in the same conversation.

Gullivr Travel App Integrates with Claude via Remote MCP Server
A developer built Gullivr, a travel planning app with a remote MCP server that allows Claude to create and manage trips directly within the app. The integration enables real-time updates while chatting with Claude, eliminating manual copying between tools.