Needle: 26M Tool-Calling Model Without FFNs

Needle is a 26M parameter model designed specifically for single-shot function calling. It uses cross-attention and gating layers with zero FFNs, based on the insight that tool calling is retrieval-and-assembly (match query to tool name, extract argument values, emit JSON) rather than reasoning. The model runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices.

Training Details

Pretrained on 200B tokens across 16 TPU v6e (27 hours)
Post-trained on 2B tokens of synthesized function-calling data (45 minutes)
Data synthesized via Gemini with 15 tool categories (timers, messaging, navigation, smart home, etc.)

Architecture: Simple Attention Networks

The entire model is just attention and gating — no MLPs anywhere. The authors argue that FFN parameters are wasted at this scale for tool calling, and that the 'no FFN' finding generalizes to any task where the model has access to external structured knowledge (RAG, tool use, retrieval-augmented generation). The model doesn't need to memorize facts in FFN weights if the facts are provided in the input.

Benchmarks

Needle beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M on single-shot function calling, though those models have more capacity for conversational settings.

How to Use

# Test the model via the playground or finetune on your Mac/PC
git clone https://github.com/cactus-compute/needle

GitHub: github.com/cactus-compute/needle
Weights: huggingface.co/Cactus-Compute/needle
Architecture writeup: Simple Attention Networks docs
Inference engine for mobile/wearables (Cactus): github.com/cactus-compute/cactus

Everything is MIT licensed.

📖 Read the full source: r/LocalLLaMA

Needle: A 26M Parameter Tool-Calling Model Built Entirely Without FFNs

Training Details

Architecture: Simple Attention Networks

Benchmarks

How to Use

👀 See Also

Omnara: Run Claude Code and Codex from Anywhere

Curated List of 260+ AI Agent Tools with Claude Ecosystem Highlights

Gemma4 26B-A4B Delivers Fast Local Performance with Web Search and Image Support

ClawRelay: macOS-native OpenAI-compatible LLM proxy with automatic failover