Needle: A 26M Parameter Function-Calling Model That Runs at 6000 tok/s on Mobile

✍️ OpenClawRadar📅 Published: May 12, 2026🔗 Source
Ad

Cactus has open-sourced Needle, a 26M parameter function-calling model designed to run on budget phones, watches, and glasses. It achieves 6000 tok/s prefill and 1200 tok/s decode on consumer devices using their custom inference engine, Cactus.

Architecture: Simple Attention Networks

Needle uses a Simple Attention Network — no MLPs anywhere. The entire model consists of attention and gating layers. Key design: d=512, 8H/4KV, BPE=8192, with an encoder-decoder structure (12 encoder layers, 8 decoder layers) using cross-attention, masked self-attention with RoPE, and tied embeddings.

Training Details

  • Pretrained on 200B tokens across 16 TPU v6e (27 hours)
  • Post-trained on 2B tokens of synthesized function-calling data (45 minutes)
  • Data synthesized via Gemini with 15 tool categories (timers, messaging, navigation, smart home, etc.)

Benchmark Results

Needle beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M on single-shot function calling. However, those models have more scope/capacity and excel in conversational settings.

Quickstart

git clone https://github.com/cactus-compute/needle.git
cd needle && source ./setup
needle playground

Opens a web UI at http://127.0.0.1:7860 for testing and fine-tuning on your own tools.

Ad

Usage (Python)

from needle import SimpleAttentionNetwork, load_checkpoint, generate, get_tokenizer

params, config = load_checkpoint("checkpoints/needle.pkl") model = SimpleAttentionNetwork(config) tokenizer = get_tokenizer()

result = generate( model, params, tokenizer, query="What's the weather in San Francisco?", tools='[{"name":"get_weather","parameters":{"location":"string"}}]', stream=False ) print(result)

[{"name":"get_weather","arguments":{"location":"San Francisco"}}]

Fine-tuning Locally

# via playground (auto-generates data via Gemini)

needle playground

or provide your own data

needle finetune data.jsonl

Availability

Weights are on Hugging Face: Cactus-Compute/needle. Everything is MIT licensed.

📖 Read the full source: HN AI Agents

Ad

👀 See Also

Screenbox: Open-Source Virtual Desktops for AI Agents Built Entirely by Voice
Tools

Screenbox: Open-Source Virtual Desktops for AI Agents Built Entirely by Voice

Screenbox provides isolated Linux desktops in Docker for AI agents, solving conflicts when multiple agents run in parallel. The project was built entirely using voice commands with Claude Code, and the creator hasn't seen a single line of the code.

OpenClawRadar
mcp-optimizer reduces token waste from idle MCP servers in Claude Code
Tools

mcp-optimizer reduces token waste from idle MCP servers in Claude Code

mcp-optimizer is a plugin that addresses token waste from MCP servers in Claude Code by analyzing tool usage and generating optimized configurations. It includes four utilities: mcp-doctor for server health checks, mcp-audit for usage analysis, mcp-optimize for creating project-local configs, and mcp-to-skills for converting tools to on-demand Skills.

OpenClawRadar
Nexus: Open-Source AI-to-AI Protocol with Discovery, Trust, and Payments
Tools

Nexus: Open-Source AI-to-AI Protocol with Discovery, Trust, and Payments

Nexus is a self-hosted protocol that enables AI agents to discover each other, negotiate terms, verify responses, and handle micropayments without human intervention. It includes five layers: discovery, trust, protocol, routing, and federation, with 66 tests and MIT licensing.

OpenClawRadar
CC-Ledger: Track Claude Code Costs Per Session and Per PR with Local SQLite
Tools

CC-Ledger: Track Claude Code Costs Per Session and Per PR with Local SQLite

CC-Ledger is a Rust binary that hooks into Claude Code, logging each turn to local SQLite. Catch runaway sessions live and see per-PR cost without an API key. Includes macOS menu bar, web dashboard, and CLI views.

OpenClawRadar