Orion: Bypassing CoreML to Run and Train LLMs Directly on Apple Neural Engine

✍️ OpenClawRadar📅 Published: March 7, 2026🔗 Source
Orion: Bypassing CoreML to Run and Train LLMs Directly on Apple Neural Engine
Ad

Direct ANE Access for LLM Workloads

Orion provides an end-to-end Objective-C system that bypasses CoreML entirely to run and train LLMs directly on the Apple Neural Engine (ANE). This approach gives developers direct control over the ANE, which has previously been treated as a black-box scheduler by CoreML, stripping away any direct control or ability to train.

Technical Implementation and Constraints

The project builds on reverse-engineering work that mapped the private ANEClient and ANECompiler APIs. The ANE presents what the developer calls a "hardware impedance mismatch" with 17 total programming constraints, 11 of which were completely undocumented. Key constraints include:

  • The concat operation causes an immediate, silent compiler failure
  • BLOBFILE weights require a 64-byte offset from the chunk header, or you get silent numerical corruption
  • The ANE maintains internal state that hard-caps at ~119 compilations per process before silently failing
Ad

Solutions to Training Challenges

Previous attempts at ANE training hit NaN divergence after a single step. Orion solves this by:

  • Wiring up a deferred compilation pipeline
  • Implementing strict activation clamping to stop fp16 overflow cascade (clamping activations to -65504 to +65504)
  • Using an exec() process restart loop after every training step to bypass the 119-compilation limit

Performance Results

The compiler lowers a 27-operation graph IR through five optimization passes down to ANE-native MIL. Current performance includes:

  • 170+ tokens/s for GPT-2 124M decode
  • Mechanically stable multi-step training on a 110M parameter transformer (the "coherence ceiling" of the hardware)
  • Over 1,000 steps, loss dropped from 12.3 to 6.2 with zero NaNs

Current Limitations

The ANE bakes weights at compile time, meaning every training update requires a ~4.2s recompilation penalty. The ANE pulls ~19 TFLOPS in fp16, but the fundamental constraint to using it hasn't been compute—it's been the complete lack of a native orchestration layer.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also