Reverse Engineering Apple Neural Engine for Training MicroGPT Models

Direct Access to Apple's Neural Engine
A developer has bypassed Apple's CoreML framework to directly access the Apple Neural Engine (ANE) on an M4 Mac mini, creating a custom training pipeline for small language models. The project involved reverse engineering ANE's private APIs using Claude, then running benchmarks and implementing training without Apple's recommended CoreML interface.
Technical Specifications and Performance
The ANE on the M4 chip provides 38 TFLOPS of claimed INT8 compute, though the developer notes it's actually a FP16 processor, making the effective compute half that amount. Peak compute on the ANE consumes only 2.8W, resulting in 6.6 TFLOPS/watt efficiency. For comparison, Metal GPU achieves approximately 1 TFLOPS/watt, while NVIDIA's H100 reaches 1.4 TFLOPS/watt.
Training Implementation
The developer created a bespoke training pipeline that successfully trained a 110M parameter MicroGPT model on the ANE. While a single chip can't practically train larger models, the developer suggests a cluster of ANE devices could theoretically train bigger models. Even on a single device, LoRA training for 3B or 7B parameter models should be feasible.
Why Train on NPUs?
The primary motivation is power efficiency. The ANE's 6.6 TFLOPS/watt efficiency makes it significantly more power-efficient than traditional GPU training methods, which is particularly valuable for edge computing and energy-conscious development.
Available Resources
- Reverse Engineering documentation
- Benchmark results
- Training implementation (Work in Progress)
- GitHub repository with code
The project demonstrates that Apple's Neural Engine, typically treated as a black box, can be accessed directly for custom AI training workflows, offering developers an alternative to GPU-based training with superior power efficiency.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Integrating Local LLM Agents with ComfyUI for Natural Language Batch Image Generation
A developer shares how they wired their local OpenClaw agent to ComfyUI, enabling natural language commands for batch image generation workflows. The integration uses a custom agent skill that maps English requests to ComfyUI workflow JSON and handles API communication.

Helix: Open-Source Framework Turns Claude into Personal AI Agent for macOS
Helix is an open-source framework that connects Claude via Claude Code in Terminal to macOS through four MCP server plugins, enabling Claude to control applications, maintain persistent memory, run scheduled tasks, and operate with local voice processing.

Argus: A VS Code Extension to Debug Claude Code Session Costs and Behavior
A developer built Argus, a VS Code extension that parses Claude Code JSONL transcripts into a real-time timeline with per-step token/cost breakdown, cache hit ratio, and flagging of retry loops, duplicate reads, and context pressure.

Ollama's Technical Issues and Community Controversy
Ollama, a popular local LLM tool, faces criticism for downplaying its reliance on llama.cpp, license compliance issues, and technical problems with its custom backend including performance regressions and reintroduced bugs.