Orion: Bypassing CoreML to Run and Train LLMs Directly on Apple Neural Engine

Direct ANE Access for LLM Workloads
Orion provides an end-to-end Objective-C system that bypasses CoreML entirely to run and train LLMs directly on the Apple Neural Engine (ANE). This approach gives developers direct control over the ANE, which has previously been treated as a black-box scheduler by CoreML, stripping away any direct control or ability to train.
Technical Implementation and Constraints
The project builds on reverse-engineering work that mapped the private ANEClient and ANECompiler APIs. The ANE presents what the developer calls a "hardware impedance mismatch" with 17 total programming constraints, 11 of which were completely undocumented. Key constraints include:
- The concat operation causes an immediate, silent compiler failure
- BLOBFILE weights require a 64-byte offset from the chunk header, or you get silent numerical corruption
- The ANE maintains internal state that hard-caps at ~119 compilations per process before silently failing
Solutions to Training Challenges
Previous attempts at ANE training hit NaN divergence after a single step. Orion solves this by:
- Wiring up a deferred compilation pipeline
- Implementing strict activation clamping to stop fp16 overflow cascade (clamping activations to -65504 to +65504)
- Using an exec() process restart loop after every training step to bypass the 119-compilation limit
Performance Results
The compiler lowers a 27-operation graph IR through five optimization passes down to ANE-native MIL. Current performance includes:
- 170+ tokens/s for GPT-2 124M decode
- Mechanically stable multi-step training on a 110M parameter transformer (the "coherence ceiling" of the hardware)
- Over 1,000 steps, loss dropped from 12.3 to 6.2 with zero NaNs
Current Limitations
The ANE bakes weights at compile time, meaning every training update requires a ~4.2s recompilation penalty. The ANE pulls ~19 TFLOPS in fp16, but the fundamental constraint to using it hasn't been compute—it's been the complete lack of a native orchestration layer.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Open-source 31-agent product development system for Claude with 12,000+ lines of content
An open-source Claude Skill provides 31 specialized AI agents and 20 strategic frameworks covering all company departments from product to compliance. The MIT-licensed system includes 62 files with 12,000+ lines of actionable content, country-specific compliance for multiple regions, and a smart-loading system that routes requests efficiently.

Claude Desktop + Blender via MCP: Real-Time 3D Workflow Closes the Feedback Loop
An open-source Blender add-on runs an MCP server inside Blender, letting Claude Desktop inspect scenes, create objects, render images, and read results—closing the script-paste feedback loop.

nan-forget: Local AI coding memory in a single SQLite file
nan-forget is a memory tool for AI coding agents that stores context in a single SQLite file (~3MB) with no background services. It uses a 3-stage retrieval pipeline and works across Claude Code, Cursor, and terminal via CLI.

Helix: Open-Source Framework Turns Claude into Personal AI Agent for macOS
Helix is an open-source framework that connects Claude via Claude Code in Terminal to macOS through four MCP server plugins, enabling Claude to control applications, maintain persistent memory, run scheduled tasks, and operate with local voice processing.