RCLI: On-Device Voice AI Pipeline for Apple Silicon

What RCLI Does

RCLI is a complete voice AI pipeline that runs speech-to-text, large language model inference, and text-to-speech entirely on-device on Apple Silicon Macs. It requires macOS 13+ on M1 or later chips and operates without cloud services or API keys.

Installation and Setup

Install via Homebrew:

brew tap RunanywhereAI/rcli https://github.com/RunanywhereAI/RCLI.git
brew install rcli
rcli setup   # downloads ~1 GB of models

Or using curl:

curl -fsSL https://raw.githubusercontent.com/RunanywhereAI/RCLI/main/install.sh | bash

Performance Claims

The developers benchmarked on an M4 Max with 64GB RAM and report:

LLM decode: 1.67x faster than llama.cpp, 1.19x faster than Apple MLX
Qwen3-0.6B: 658 tokens/sec (vs mlx-lm 552, llama.cpp 295)
Qwen3-4B: 186 tokens/sec (vs mlx-lm 170, llama.cpp 87)
Time-to-first-token: 6.6 ms
STT: 70 seconds of audio transcribed in 101 ms (714x real-time, 4.6x faster than mlx-whisper)
TTS: 178 ms synthesis (2.8x faster than mlx-audio and sherpa-onnx)

Key Features

Three concurrent threads with lock-free ring buffers
Double-buffered TTS (next sentence renders while current plays)
38 macOS actions controllable by voice
Local RAG with ~4 ms retrieval over 5K+ document chunks
20 hot-swappable models
Full-screen TUI with per-operation latency readouts
Falls back to llama.cpp when MetalRT isn't installed

Voice Pipeline Components

VAD: Silero voice activity detection
STT: Zipformer streaming + Whisper/Parakeet offline
LLM: Qwen3/LFM2/Qwen3.5 with KV cache continuation and Flash Attention
TTS: Double-buffered sentence-level synthesis
Tool Calling: LLM-native tool call formats
Multi-turn Memory: Sliding window conversation history with token-budget trimming

Usage Commands

rcli              # interactive TUI with push-to-talk
rcli listen       # continuous voice mode
rcli ask "open Safari"  # one-shot command
rcli rag ingest ~/Documents/notes  # index documents for RAG
rcli ask --rag ~/Library/RCLI/index "summarize the project plan"

TUI Controls

SPACE: Push-to-talk
M: Models browser for downloading and hot-swapping LLM/STT/TTS
A: Actions browser to enable/disable macOS actions
B: Run STT, LLM, TTS, and end-to-end benchmarks
R: RAG document ingestion
X: Clear conversation and reset context
T: Toggle tool call trace
ESC: Stop/close/quit

MetalRT Engine Details

MetalRT is RunAnywhere's proprietary GPU inference engine that uses Metal 3.1 features available on M3, M3 Pro, M3 Max, M4, and later chips. M1/M2 support is planned. The engine uses custom Metal compute shaders for quantized matmul, attention, and activation operations, compiled ahead of time and dispatched directly to the GPU with zero allocations during inference.

macOS Actions

RCLI includes 43 macOS actions across categories:

Productivity: create_note, create_reminder, run_shortcut
Communication: send_message, facetime_call
Media: play_on_spotify, play_apple_music, play_pause, next_track, set_music_volume
System: open_app, quit_app, set_volume, toggle_dark_mode, screenshot, lock_screen
Web: search_web, search_youtube, open_url, open_maps

📖 Read the full source: HN AI Agents