Reasoning Guard: Proxy-Level Loop Detection for Local LLM Inference

A developer running Qwen3.6 MoE behind a vLLM proxy hit a common reliability issue: runaway reasoning loops where the model repeats itself inside a reasoning block, burning tokens and stalling agents. At 180+ tokens/sec, even a 20–30 second loop wastes GPU time and blocks client requests. They built a lightweight guard that lives in the proxy layer and enforces deterministic checks on the streaming output before it reaches the client.
Architecture
Client → Proxy → vLLM → Model
The proxy intercepts the streaming response as it leaves vLLM. It does not modify model weights, call a second LLM, or use embeddings or semantic analysis. All checks are cheap and deterministic.
What It Checks
- Reasoning token caps (configurable per effort level)
- Repeated paragraph detection
- Sliding-window n-gram repetition
- Repeated sentence fingerprinting
- Fuzzy opening-pattern detection (catches loops like “Actually, I think I’ve found it…”)
- Cut-and-continue recovery path
Recovery Flow
When the guard triggers, it:
- Stops the upstream stream
- Captures the reasoning produced so far
- Reissues the request with that reasoning baked in as prior assistant context
- Disables thinking for the continuation
- Merges phase 1 and phase 2 usage stats
Because vLLM prefix caching is already active, the continuation is effectively seamless. Phase 2 usually resumes with ~50–100ms TTFT, so the client sees reasoning flow directly into the final answer instead of hanging.
Observability
The proxy logs each trigger with:
- Whether the guard fired
- Trigger reason
- Token cap used
- Reasoning token count
- Merged total usage
- Stream-end metadata
Result
Before: occasional 2000+ token reasoning blocks that went nowhere. After: the model still reasons when useful, but runaway thinking gets cut and redirected into an answer. The author describes it as a “proxy-level seatbelt for local LLM inference.”
No model surgery, no extra LLM calls — just stream interception, token counting, loop detection, and a clean recovery path. The guard has been validated end-to-end through the live proxy against real trace logs.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Coasts: Containerized Hosts for Running Multiple Localhost Environments
Coasts is a Docker-in-Docker solution that solves the problem of running multiple localhost environments simultaneously, handling port conflicts, secrets, and volume topologies without requiring complex scripting.

OpenClaw-superpowers adds reliability features for operational failure modes
The openclaw-superpowers repository has expanded with eight new reliability-focused skills including deployment preflight checks, cron execution proofing, session reset recovery, and MCP auth lifecycle management. These additions bring the total to 60 skills, with 44 being OpenClaw-native and 23 designed for cron scheduling.

Why AI Bounty Hunters Are Losing Money: Data from 60 Issues
A developer tried to make Claude earn money on open-source bounties with a $20 token budget. After scanning 80+ Algora bounties, they found most are saturated with 10+ open PRs, $1 spam, or reserved for interviews. Expected value: $0.
Cocall.ai MCP: Outbound Phone Calls with Real-Time Human Escalation
Cocall.ai is an MCP for Claude that enables outbound phone calls with a full-duplex speech-to-speech model. It can pause mid-call to ask you a specific question instead of guessing, navigate IVR, and hand off calls to you when needed.