Needle: A 26M Parameter Tool-Calling Model Built Entirely Without FFNs
Needle is a 26M parameter model designed specifically for single-shot function calling. It uses cross-attention and gating layers with zero FFNs, based on the insight that tool calling is retrieval-and-assembly (match query to tool name, extract argument values, emit JSON) rather than reasoning. The model runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices.
Training Details
- Pretrained on 200B tokens across 16 TPU v6e (27 hours)
- Post-trained on 2B tokens of synthesized function-calling data (45 minutes)
- Data synthesized via Gemini with 15 tool categories (timers, messaging, navigation, smart home, etc.)
Architecture: Simple Attention Networks
The entire model is just attention and gating — no MLPs anywhere. The authors argue that FFN parameters are wasted at this scale for tool calling, and that the 'no FFN' finding generalizes to any task where the model has access to external structured knowledge (RAG, tool use, retrieval-augmented generation). The model doesn't need to memorize facts in FFN weights if the facts are provided in the input.
Benchmarks
Needle beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M on single-shot function calling, though those models have more capacity for conversational settings.
How to Use
# Test the model via the playground or finetune on your Mac/PC
git clone https://github.com/cactus-compute/needle
- GitHub: github.com/cactus-compute/needle
- Weights: huggingface.co/Cactus-Compute/needle
- Architecture writeup: Simple Attention Networks docs
- Inference engine for mobile/wearables (Cactus): github.com/cactus-compute/cactus
Everything is MIT licensed.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Omnara: Run Claude Code and Codex from Anywhere
Omnara is a web and mobile IDE that lets developers run and interact with Claude Code and Codex sessions from anywhere, with features like cloud syncing and a voice agent.

Curated List of 260+ AI Agent Tools with Claude Ecosystem Highlights
A GitHub repository contains a curated list of 260+ AI agent tools, including specific Claude-related entries like Claude Code (80.9% SWE-bench), Claude Computer Use, and Claude in Chrome, plus tools that work well with Claude such as Cline and Cursor.

Gemma4 26B-A4B Delivers Fast Local Performance with Web Search and Image Support
The gemma-4-26B-A4B model achieves approximately 145 tokens per second on an RTX 4090 and includes web search MCP and image support for chat applications. A blog post details setup and cross-platform usage on Mac and iPhone.

ClawRelay: macOS-native OpenAI-compatible LLM proxy with automatic failover
ClawRelay runs an OpenAI-compatible HTTP server on macOS 15+ with automatic failover between LLM providers. It supports OpenAI, Groq, Nvidia NIMs, Ollama, and any service with a /v1/chat/completions endpoint.