Lightning MLX: Fast Local AI Engine for Apple Silicon Agentic Use Delivers 220 tok/s on Qwen 35B-A3B

A new open-source inference engine for Apple Silicon called Lightning MLX claims to be the fastest local AI engine, specifically optimized for agentic workflows — coding agents, tool calling, and short-turn tasks. The project is available on GitHub at samuelfaj/lightning-mlx.
Benchmark Results
The author tested on a MacBook Max M5 with 128GB RAM and reported the following token generation speeds:
- Qwen3.6-27B: 40.67 tok/s
- Qwen3.6-35B-A3B: 220.86 tok/s
These results suggest that the engine is particularly efficient for the mixture-of-expert architecture used in the Qwen3.6-35B-A3B model, which activates only a subset of parameters per token.
Key Features
- Optimized for short-turn agentic use cases — code generation, tool calls, and rapid inference loops
- Includes a preset configuration called MTPLX (custom sampling defaults); the author is seeking feedback on whether these defaults make sense for production use
- Open source under the MIT license (likely) on GitHub
Feedback Requests
The creator is actively asking the community for:
- Better benchmark designs for local coding agents
- Opinions on the MTPLX preset defaults
- Test results on other Apple Silicon configurations (e.g., M1, M2, M3, M4, different RAM sizes)
Who It's For
Developers running local LLMs on Apple Silicon for agentic coding workflows who need maximum inference speed.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Telegram Bot for Claude Code CLI Control from Mobile
A developer built a Telegram bot that bridges to the Claude Code CLI, allowing control via mobile commands like /commit, /code_review, and /simplify. The bot auto-discovers custom skills, processes photos/documents/voice notes, and supports group chat sessions.

Markdown as Protocol for Agentic UI with Streaming Execution
A prototype uses Markdown as a unified protocol for AI agents to stream text, executable code, and data in a single response. It features streaming execution where code runs statement-by-statement as it arrives and a mount() primitive for creating React UIs with data flow between client, server, and LLM.

Multi-Agent Career Mentor Built with Ollama and MCP for Local AI
A developer built a 5-agent AI system that analyzes resumes and generates career intelligence reports using Ollama with llama3 locally. The system chains agent outputs so each builds on previous context, with MCP handling tool integration.

Building a Programming Language with Claude Code: The Cutlet Experiment
Ankur Sethi built a complete programming language called Cutlet using Claude Code over four weeks, with the AI generating every line of code while he focused on guardrails and testing. The language features dynamic typing, vectorized operations, and a REPL, running on macOS and Linux.