Needle: 26M Parameter Model Runs at 6000 tok/s on Mobile

Cactus has open-sourced Needle, a 26M parameter function-calling model designed to run on budget phones, watches, and glasses. It achieves 6000 tok/s prefill and 1200 tok/s decode on consumer devices using their custom inference engine, Cactus.

Architecture: Simple Attention Networks

Needle uses a Simple Attention Network — no MLPs anywhere. The entire model consists of attention and gating layers. Key design: d=512, 8H/4KV, BPE=8192, with an encoder-decoder structure (12 encoder layers, 8 decoder layers) using cross-attention, masked self-attention with RoPE, and tied embeddings.

Training Details

Pretrained on 200B tokens across 16 TPU v6e (27 hours)
Post-trained on 2B tokens of synthesized function-calling data (45 minutes)
Data synthesized via Gemini with 15 tool categories (timers, messaging, navigation, smart home, etc.)

Benchmark Results

Needle beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M on single-shot function calling. However, those models have more scope/capacity and excel in conversational settings.

Quickstart

git clone https://github.com/cactus-compute/needle.git
cd needle && source ./setup
needle playground

Opens a web UI at http://127.0.0.1:7860 for testing and fine-tuning on your own tools.

Usage (Python)

from needle import SimpleAttentionNetwork, load_checkpoint, generate, get_tokenizer

params, config = load_checkpoint("checkpoints/needle.pkl")
model = SimpleAttentionNetwork(config)
tokenizer = get_tokenizer()
result = generate(
    model, params, tokenizer,
    query="What's the weather in San Francisco?",
    tools='[{"name":"get_weather","parameters":{"location":"string"}}]',
    stream=False
)
print(result)
[{"name":"get_weather","arguments":{"location":"San Francisco"}}]

Fine-tuning Locally

# via playground (auto-generates data via Gemini) needle playground or provide your own data

needle finetune data.jsonl

Availability

Weights are on Hugging Face: Cactus-Compute/needle. Everything is MIT licensed.

📖 Read the full source: HN AI Agents

Needle: A 26M Parameter Function-Calling Model That Runs at 6000 tok/s on Mobile

Architecture: Simple Attention Networks

Training Details

Benchmark Results

Quickstart

Usage (Python)

`[{"name":"get_weather","arguments":{"location":"San Francisco"}}]`

Fine-tuning Locally

or provide your own data

Availability

👀 See Also

Screenbox: Open-Source Virtual Desktops for AI Agents Built Entirely by Voice

mcp-optimizer reduces token waste from idle MCP servers in Claude Code

Nexus: Open-Source AI-to-AI Protocol with Discovery, Trust, and Payments

CC-Ledger: Track Claude Code Costs Per Session and Per PR with Local SQLite