Needle: A 26M Parameter Tool-Calling Model Built Entirely Without FFNs

✍️ OpenClawRadar📅 Published: May 12, 2026🔗 Source
Ad

Needle is a 26M parameter model designed specifically for single-shot function calling. It uses cross-attention and gating layers with zero FFNs, based on the insight that tool calling is retrieval-and-assembly (match query to tool name, extract argument values, emit JSON) rather than reasoning. The model runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices.

Training Details

  • Pretrained on 200B tokens across 16 TPU v6e (27 hours)
  • Post-trained on 2B tokens of synthesized function-calling data (45 minutes)
  • Data synthesized via Gemini with 15 tool categories (timers, messaging, navigation, smart home, etc.)

Architecture: Simple Attention Networks

The entire model is just attention and gating — no MLPs anywhere. The authors argue that FFN parameters are wasted at this scale for tool calling, and that the 'no FFN' finding generalizes to any task where the model has access to external structured knowledge (RAG, tool use, retrieval-augmented generation). The model doesn't need to memorize facts in FFN weights if the facts are provided in the input.

Ad

Benchmarks

Needle beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M on single-shot function calling, though those models have more capacity for conversational settings.

How to Use

# Test the model via the playground or finetune on your Mac/PC
git clone https://github.com/cactus-compute/needle

Everything is MIT licensed.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also