Running Google Gemma 4 26B-A4B Locally with LM Studio 0.4.0 Headless CLI

What LM Studio 0.4.0 Adds for Local AI
LM Studio 0.4.0 fundamentally changes the architecture by extracting the core inference engine into llmster, a standalone server. This enables running LM Studio entirely from the command line using the new lms CLI, eliminating the need for the GUI. The update makes it usable on headless servers, in CI/CD pipelines, SSH sessions, or for terminal-focused developers.
Key Features in 0.4.0
- llmster daemon: A background service that manages model loading and inference without the desktop app
- lms CLI: Full command-line interface for downloading, loading, chatting, and serving models
- Parallel request processing: Continuous batching instead of sequential queuing, allowing multiple requests to the same model to run concurrently
- Stateful REST API: A new /v1/chat endpoint that maintains conversation history across requests
- MCP integration: Local Model Context Protocol support with permission-key gating
Why Gemma 4 26B-A4B for Local Use
Google's Gemma 4 26B-A4B uses a mixture-of-experts architecture with 128 experts plus 1 shared expert, but only activates 8 experts (3.8B parameters) per token. This means it runs well on hardware that couldn't handle a dense 26B model. On a 14" MacBook Pro M4 Pro with 48GB unified memory, it fits comfortably and generates at 51 tokens/second.
The model scores 82.6% on MMLU Pro and 88.3% on AIME 2026, close to the dense 31B variant (85.2% and 89.2%) while running dramatically faster. It achieves an Elo score of ~1441, competing with models like Qwen 3.5 397B-A17B (~1450 Elo) that require 100-600B total parameters.
Key capabilities include 256K max context, vision support for analyzing screenshots and diagrams, native function/tool calling, and reasoning with configurable thinking modes.
Practical Setup
The article walks through installing the lms CLI and setting up Gemma 4 26B-A4B for local inference that can be used with Claude Code. The author notes significant slowdowns when used within Claude Code from their experience.
📖 Read the full source: HN AI Agents
👀 See Also

Introducing NetViews 2.3: A Robust Network Diagnostic Tool for macOS
NetViews 2.3 combines host discovery, Wi-Fi insights, and real-time monitoring with a streamlined GUI for better network diagnostics on macOS.

llm-idle-timeout Fires at 2 Minutes on N100/WSL2 Despite timeoutSeconds Setting
A user reports that the idle watchdog in OpenClaw fires after 2 minutes on N100/WSL2 hardware, ignoring the timeoutSeconds=300 setting, due to slow gateway startup (45+ seconds) and no configurable noOutputTimeoutMs.

Tatu: Open-source security layer for Claude Code blocks secrets and destructive commands
Tatu is an open-source hook system that intercepts Claude Code actions in real time to block leaked secrets, flag PII, and deny destructive commands before execution. Installation is via pip/pipx with 'tatu-hook init' to enable audit mode.

Testing MiniMax M2.7 via API on Three Real ML and Coding Workflows
A developer benchmarks MiniMax M2.7 against Claude Opus 4.7 on three real tasks: refactoring a PyTorch project, drafting Obsidian notes, and more. Key findings and setup included.