Needle: A 26M Parameter Function-Calling Model That Runs at 6000 tok/s on Mobile
Cactus has open-sourced Needle, a 26M parameter function-calling model designed to run on budget phones, watches, and glasses. It achieves 6000 tok/s prefill and 1200 tok/s decode on consumer devices using their custom inference engine, Cactus.
Architecture: Simple Attention Networks
Needle uses a Simple Attention Network — no MLPs anywhere. The entire model consists of attention and gating layers. Key design: d=512, 8H/4KV, BPE=8192, with an encoder-decoder structure (12 encoder layers, 8 decoder layers) using cross-attention, masked self-attention with RoPE, and tied embeddings.
Training Details
- Pretrained on 200B tokens across 16 TPU v6e (27 hours)
- Post-trained on 2B tokens of synthesized function-calling data (45 minutes)
- Data synthesized via Gemini with 15 tool categories (timers, messaging, navigation, smart home, etc.)
Benchmark Results
Needle beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M on single-shot function calling. However, those models have more scope/capacity and excel in conversational settings.
Quickstart
git clone https://github.com/cactus-compute/needle.git
cd needle && source ./setup
needle playgroundOpens a web UI at http://127.0.0.1:7860 for testing and fine-tuning on your own tools.
Usage (Python)
from needle import SimpleAttentionNetwork, load_checkpoint, generate, get_tokenizer
params, config = load_checkpoint("checkpoints/needle.pkl")
model = SimpleAttentionNetwork(config)
tokenizer = get_tokenizer()
result = generate(
model, params, tokenizer,
query="What's the weather in San Francisco?",
tools='[{"name":"get_weather","parameters":{"location":"string"}}]',
stream=False
)
print(result)
[{"name":"get_weather","arguments":{"location":"San Francisco"}}]
Fine-tuning Locally
# via playground (auto-generates data via Gemini)
needle playground
or provide your own data
needle finetune data.jsonl
Availability
Weights are on Hugging Face: Cactus-Compute/needle. Everything is MIT licensed.
📖 Read the full source: HN AI Agents
👀 See Also

Screenbox: Open-Source Virtual Desktops for AI Agents Built Entirely by Voice
Screenbox provides isolated Linux desktops in Docker for AI agents, solving conflicts when multiple agents run in parallel. The project was built entirely using voice commands with Claude Code, and the creator hasn't seen a single line of the code.

mcp-optimizer reduces token waste from idle MCP servers in Claude Code
mcp-optimizer is a plugin that addresses token waste from MCP servers in Claude Code by analyzing tool usage and generating optimized configurations. It includes four utilities: mcp-doctor for server health checks, mcp-audit for usage analysis, mcp-optimize for creating project-local configs, and mcp-to-skills for converting tools to on-demand Skills.

Nexus: Open-Source AI-to-AI Protocol with Discovery, Trust, and Payments
Nexus is a self-hosted protocol that enables AI agents to discover each other, negotiate terms, verify responses, and handle micropayments without human intervention. It includes five layers: discovery, trust, protocol, routing, and federation, with 66 tests and MIT licensing.

CC-Ledger: Track Claude Code Costs Per Session and Per PR with Local SQLite
CC-Ledger is a Rust binary that hooks into Claude Code, logging each turn to local SQLite. Catch runaway sessions live and see per-PR cost without an API key. Includes macOS menu bar, web dashboard, and CLI views.