Cerebras releases Step-3.5-Flash-REAP models with 40% memory reduction

What this is
Cerebras has released Step-3.5-Flash-REAP models, which are memory-efficient compressed variants of their larger models. These are smaller versions designed for what the source calls "potato setups," though the 121B parameter model still requires significant resources.
Key details from the source
The models are available on Hugging Face:
The Step-3.5-Flash-REAP-121B-A11B model is compressed from 196B to 121B parameters, representing a 40% memory reduction while maintaining near-identical performance to the full model.
The compression uses REAP (Router-weighted Expert Activation Pruning), described as "a novel expert pruning method that selectively removes redundant experts while preserving the router's independent control over remaining experts."
Features and capabilities
- Near-lossless performance: Maintains almost identical accuracy on code generation, agentic coding, and function calling tasks compared to the full 196B model
- 40% memory reduction: Compressed from 196B to 121B parameters, lowering deployment costs and memory requirements
- Preserved capabilities: Retains all core functionalities including code generation, math & reasoning, and tool calling
- Drop-in compatibility: Works with vanilla vLLM - no source modifications or custom patches required
- Optimized for real-world use: Particularly effective for resource-constrained environments, local deployments, and academic research
The source notes that while these are "smaller versions," the 121B model still requires a fairly powerful setup despite the compression.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Palantir AI to be embedded across US military according to report
A report indicates the US military plans to embed Palantir's AI technology across all branches. The article generated 37 points and 24 comments on Hacker News.

Anthropic Moves Claude Code Background Automation to Separate SDK Credit Bucket, Breaking Agent Workflows
Starting June 15, claude -p, Agent SDK usage, Claude Code GitHub Actions, and third-party Agent SDK apps stop counting against Pro/Max interactive quotas. A new separate Agent SDK credit bucket applies: $100/month for Max 5x plans. Background agent stacks (e.g., tickets → agents → hooks → executor → claude -p) will burn through this fast.

Claude Code adds voice mode for hands-free coding commands
Anthropic is rolling out voice mode for Claude Code, its AI coding assistant, allowing developers to interact via spoken commands. The feature is currently live for about 5% of users with broader availability planned in coming weeks.

Claude vs GPT-4o: Same Double Pendulum Prompt, Different Coordinate Conventions
Claude and GPT-4o produce visually different double pendulum simulations because they interpret theta from opposite verticals — top vs bottom — while using the same renderer. The math is correct in both cases, but the mismatch reveals a subtle ambiguity in prompt interpretation.