Local Qwen Models Achieve Browser Automation with Stepwise Planning and Compact DOM

Stepwise Planning Overcomes Upfront Planning Failures
The developer discovered that asking models to invent a full multi-step plan before seeing the real page state works on familiar sites but breaks quickly on unexpected elements. What worked better was stepwise planning where the model replans from the current DOM snapshot at each step.
Example Flow on Ace Hardware
The tested flow with Qwen 8B as planner and 4B as executor on Ace Hardware (a site the model had no prior task for) completed a full cart flow with zero vision model usage. The stepwise approach looked like this:
- Step 1: see search box → TYPE "grass mower"
- Step 2: see results → CLICK Add to Cart
- Step 3: drawer appears → dismiss it
- Step 4: cart visible → CLICK View Cart
- Step 5: DONE
Compact DOM Representation Enables Small Models
The model never sees raw HTML or screenshots—just a semantic table representation:
id|role|text|importance|bg|clickable|nearby_text
665|button|Proceed to checkout|675|orange|1|
761|button|Add to cart|720|yellow|1|$299.99
1488|link|ThinkPad E16|478|none|1|Laptop 16"
This allows the 4B executor to pick an element ID from a short list. Vision approaches burn 2-3K tokens per screenshot, easily 50-100K+ for a full flow, while compact snapshots use ~15K total for the same task.
Modal Handling Critical for Success
After each click, if the DOM suddenly grows, the agent scans for dismiss patterns (close, ×, no thanks, etc.) before planning again. This fixed many failures that appeared to be "bad reasoning" but were actually hidden overlays.
The developer notes being curious if others are seeing stepwise planning beat upfront planning once sites get unfamiliar.
📖 Read the full source: r/LocalLLaMA
👀 See Also

TranscriptionSuite v1.1.2 adds WhisperX, NeMo, and VibeVoice models
TranscriptionSuite v1.1.2 now offers three transcription pipelines: WhisperX with PyAnnote diarization, NeMo models (Parakeet & Canary) with PyAnnote diarization, and VibeVoice models with built-in diarization. The update includes a model manager, parallel processing, shortcut controls, and a 24kHz recording pipeline for VibeVoice.

Crit: Local-first, single-binary CLI for reviewing agent plans and diffs
Crit is a single-binary CLI that opens files or diffs in a browser with a GitHub-inspired UI, allowing multi-round feedback loops with AI coding agents — no account needed.

OpenPlawd: OpenClaw Skill for Automated Plaud Meeting Notes
OpenPlawd is an OpenClaw skill that automatically processes Plaud recordings into structured HTML meeting notes. It polls Plaud accounts hourly, transcribes with Whisper or OpenAI, chunks large files, and generates notes with action items via an OpenClaw agent.

Voxlert: Voice Notifications for Claude Code Sessions with Character Voices
Voxlert is a tool that hooks into Claude Code events and speaks notifications using distinct character voices like StarCraft Adjutant, SHODAN, GLaDOS, and HEV Suit. It uses an LLM via OpenRouter to generate in-character lines and runs locally with npm installation.