Mercury 2: Diffusion-Based Model for Real-Time AI Coding

What Mercury 2 Is
Mercury 2 is a diffusion-based AI model that generates tokens in parallel rather than sequentially, using a process that refines output over multiple steps. This approach differs from traditional autoregressive models that decode tokens one by one.
Technical Specifications
- Generation method: Diffusion-based generation instead of sequential token-by-token decoding
- Processing approach: Generates tokens in parallel and refines them over a few steps
- Performance: Claims 1,009 tokens/sec on NVIDIA Blackwell GPUs
- Pricing: $0.25 per 1 million input tokens, $0.75 per 1 million output tokens
- Context window: 128K tokens
- Reasoning capability: Tunable reasoning
- Tool integration: Native tool use with schema-aligned JSON output
- API compatibility: OpenAI API compatible
Target Use Cases
The developers are positioning Mercury 2 for:
- Coding assistants
- Agentic loops (multi-step inference chains)
- Real-time voice systems
- RAG/search pipelines with multi-hop retrieval
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude MAX Plan Now Includes 1M Token Context Window at No Extra Cost
The Claude MAX plan has been automatically upgraded to include a 1 million token context window without additional API-based usage charges, with users reporting significantly reduced token usage and elimination of context window management overhead.

Claude Code v2.1.129: Autonomous Loop Persistence Guidance and Background Agent State Classifier
Claude Code v2.1.129 adds CLAUDE_CODE_LOOP_PERSISTENT system prompt for autonomous work loops, removes verification specialist subagent, and expands background agent state classifier with detailed boundaries.

Setting Up Subagents in OpenClaw: Key Considerations
Users experimenting with OpenClaw are facing issues with setting up subagents, particularly when modifying JSON files.

Qwen KV Cache Quantization Deep Dive: PPL, KL Divergence, and Asymmetric K/V Results
Second round of benchmarks on Qwen 3.6-35B-A3B with KV cache quantization: perplexity, KL divergence, asymmetric K/V combos, and 64K context depth on Apple M5 Max.