Reasoning Guard: Proxy-Level Loop Detection for Local LLM Inference

✍️ OpenClawRadar📅 Published: April 30, 2026🔗 Source
Reasoning Guard: Proxy-Level Loop Detection for Local LLM Inference
Ad

A developer running Qwen3.6 MoE behind a vLLM proxy hit a common reliability issue: runaway reasoning loops where the model repeats itself inside a reasoning block, burning tokens and stalling agents. At 180+ tokens/sec, even a 20–30 second loop wastes GPU time and blocks client requests. They built a lightweight guard that lives in the proxy layer and enforces deterministic checks on the streaming output before it reaches the client.

Architecture

Client → Proxy → vLLM → Model

The proxy intercepts the streaming response as it leaves vLLM. It does not modify model weights, call a second LLM, or use embeddings or semantic analysis. All checks are cheap and deterministic.

What It Checks

  • Reasoning token caps (configurable per effort level)
  • Repeated paragraph detection
  • Sliding-window n-gram repetition
  • Repeated sentence fingerprinting
  • Fuzzy opening-pattern detection (catches loops like “Actually, I think I’ve found it…”)
  • Cut-and-continue recovery path
Ad

Recovery Flow

When the guard triggers, it:

  • Stops the upstream stream
  • Captures the reasoning produced so far
  • Reissues the request with that reasoning baked in as prior assistant context
  • Disables thinking for the continuation
  • Merges phase 1 and phase 2 usage stats

Because vLLM prefix caching is already active, the continuation is effectively seamless. Phase 2 usually resumes with ~50–100ms TTFT, so the client sees reasoning flow directly into the final answer instead of hanging.

Observability

The proxy logs each trigger with:

  • Whether the guard fired
  • Trigger reason
  • Token cap used
  • Reasoning token count
  • Merged total usage
  • Stream-end metadata

Result

Before: occasional 2000+ token reasoning blocks that went nowhere. After: the model still reasons when useful, but runaway thinking gets cut and redirected into an answer. The author describes it as a “proxy-level seatbelt for local LLM inference.”

No model surgery, no extra LLM calls — just stream interception, token counting, loop detection, and a clean recovery path. The guard has been validated end-to-end through the live proxy against real trace logs.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also