Merlin Research releases Qwen3.5-4B-Safety-Thinking model for structured reasoning

Merlin Research has released Qwen3.5-4B-Safety-Thinking, a 4 billion parameter safety-aligned reasoning model built on Qwen3.5. This model is specifically designed for structured 'thinking' and safety applications in real-world scenarios, with particular focus on agent systems.
Key improvements and features
- Improved ability to accurately follow strict instructions in prompts
- Based on the use of Bloom and Petri methods from Anthropic
- Resistant to hacking attempts
- Increased resistance to 'abnormal' and adversarial prompts
- Up to 1 million token context window
- Uses frameworks from Anthropic - Bloom and Petri
The model is available on Hugging Face at MerlinSafety/Qwen3.5-4B-Safety-Thinking.
For developers working with AI agents, this model represents a specialized tool for safety-critical applications where structured reasoning and resistance to prompt manipulation are priorities. The integration of Anthropic's Bloom and Petri methods suggests a focus on constitutional AI approaches to alignment.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude App Ranks Second in US App Store After Pentagon Dispute
Anthropic's Claude chatbot app rose to number two among free apps in Apple's US App Store, climbing from outside the top 100 in late January to second place by late February 2026. This surge followed the company's public negotiations with the Pentagon over AI usage restrictions.

Anthropic API Billing Bug: Sonnet Model Charged at Opus Rates
A user discovered that the Anthropic API is incorrectly billing the claude-sonnet-4-6 model at Opus pricing rates, despite returning the correct model string. The bug was identified through analysis of raw event data showing a cost discrepancy.

KV Cache Architecture Evolution: From GPT-2 to Mamba
Analysis of KV cache memory costs shows GPT-2 used 300 KiB/token, Llama 3 reduced it to 128 KiB/token with grouped-query attention, and DeepSeek V3 achieved 68.6 KiB/token with multi-head latent attention. Mamba/SSMs eliminate KV cache entirely with fixed-size hidden states.

Benchmark shows smaller 4B model outperforms larger LLMs for phone-to-home chat applications
A benchmark of 8 local LLMs for phone-to-home chat applications found Gemma3:4B won with a composite fitness score of 88.7 despite being the smallest model, outperforming larger models up to 24B parameters due to faster response times and lower thermal load.