Local Qwen Models Achieve Browser Automation with Stepwise Planning and Compact DOM

✍️ OpenClawRadar📅 Published: March 17, 2026🔗 Source

Stepwise Planning Overcomes Upfront Planning Failures

The developer discovered that asking models to invent a full multi-step plan before seeing the real page state works on familiar sites but breaks quickly on unexpected elements. What worked better was stepwise planning where the model replans from the current DOM snapshot at each step.

Example Flow on Ace Hardware

The tested flow with Qwen 8B as planner and 4B as executor on Ace Hardware (a site the model had no prior task for) completed a full cart flow with zero vision model usage. The stepwise approach looked like this:

Step 1: see search box → TYPE "grass mower"
Step 2: see results → CLICK Add to Cart
Step 3: drawer appears → dismiss it
Step 4: cart visible → CLICK View Cart
Step 5: DONE

Compact DOM Representation Enables Small Models

The model never sees raw HTML or screenshots—just a semantic table representation:

id|role|text|importance|bg|clickable|nearby_text
665|button|Proceed to checkout|675|orange|1|
761|button|Add to cart|720|yellow|1|$299.99
1488|link|ThinkPad E16|478|none|1|Laptop 16"

This allows the 4B executor to pick an element ID from a short list. Vision approaches burn 2-3K tokens per screenshot, easily 50-100K+ for a full flow, while compact snapshots use ~15K total for the same task.

Modal Handling Critical for Success

After each click, if the DOM suddenly grows, the agent scans for dismiss patterns (close, ×, no thanks, etc.) before planning again. This fixed many failures that appeared to be "bad reasoning" but were actually hidden overlays.

The developer notes being curious if others are seeing stepwise planning beat upfront planning once sites get unfamiliar.

📖 Read the full source: r/LocalLLaMA

👀 See Also

Tools

TranscriptionSuite v1.1.2 adds WhisperX, NeMo, and VibeVoice models

TranscriptionSuite v1.1.2 now offers three transcription pipelines: WhisperX with PyAnnote diarization, NeMo models (Parakeet & Canary) with PyAnnote diarization, and VibeVoice models with built-in diarization. The update includes a model manager, parallel processing, shortcut controls, and a 24kHz recording pipeline for VibeVoice.

Apr 17, 2026, 02:45 PM UTC

OpenClawRadar

Tools

Crit: Local-first, single-binary CLI for reviewing agent plans and diffs

Crit is a single-binary CLI that opens files or diffs in a browser with a GitHub-inspired UI, allowing multi-round feedback loops with AI coding agents — no account needed.

May 8, 2026, 10:20 PM UTC

OpenClawRadar

Tools

OpenPlawd: OpenClaw Skill for Automated Plaud Meeting Notes

OpenPlawd is an OpenClaw skill that automatically processes Plaud recordings into structured HTML meeting notes. It polls Plaud accounts hourly, transcribes with Whisper or OpenAI, chunks large files, and generates notes with action items via an OpenClaw agent.

Apr 16, 2026, 02:48 PM UTC

OpenClawRadar

Tools

Voxlert: Voice Notifications for Claude Code Sessions with Character Voices

Voxlert is a tool that hooks into Claude Code events and speaks notifications using distinct character voices like StarCraft Adjutant, SHODAN, GLaDOS, and HEV Suit. It uses an LLM via OpenRouter to generate in-character lines and runs locally with npm installation.

Mar 11, 2026, 03:45 AM UTC

OpenClawRadar