Reverse Engineering Apple Neural Engine for MicroGPT Training

Direct Access to Apple's Neural Engine

A developer has bypassed Apple's CoreML framework to directly access the Apple Neural Engine (ANE) on an M4 Mac mini, creating a custom training pipeline for small language models. The project involved reverse engineering ANE's private APIs using Claude, then running benchmarks and implementing training without Apple's recommended CoreML interface.

Technical Specifications and Performance

The ANE on the M4 chip provides 38 TFLOPS of claimed INT8 compute, though the developer notes it's actually a FP16 processor, making the effective compute half that amount. Peak compute on the ANE consumes only 2.8W, resulting in 6.6 TFLOPS/watt efficiency. For comparison, Metal GPU achieves approximately 1 TFLOPS/watt, while NVIDIA's H100 reaches 1.4 TFLOPS/watt.

Training Implementation

The developer created a bespoke training pipeline that successfully trained a 110M parameter MicroGPT model on the ANE. While a single chip can't practically train larger models, the developer suggests a cluster of ANE devices could theoretically train bigger models. Even on a single device, LoRA training for 3B or 7B parameter models should be feasible.

Why Train on NPUs?

The primary motivation is power efficiency. The ANE's 6.6 TFLOPS/watt efficiency makes it significantly more power-efficient than traditional GPU training methods, which is particularly valuable for edge computing and energy-conscious development.

Available Resources

Reverse Engineering documentation
Benchmark results
Training implementation (Work in Progress)
GitHub repository with code

The project demonstrates that Apple's Neural Engine, typically treated as a black box, can be accessed directly for custom AI training workflows, offering developers an alternative to GPU-based training with superior power efficiency.

📖 Read the full source: r/LocalLLaMA