DoomVLM: Open Source Tool for Testing Vision Language Models in Doom Deathmatches

What DoomVLM Does
DoomVLM is a Jupyter notebook that tests vision language models (VLMs) by having them play Doom. It takes screenshots from ViZDoom, draws a numbered column grid on top, and sends the image to any VLM via an OpenAI-compatible API. The model has two tools: shoot(column) and move(direction), with tool_choice: "required". This is pure vision inference—no reinforcement learning or fine-tuning.
Key Features and Updates
- Deathmatch Modes: Two modes added. Benchmark—models take turns playing against bots under identical conditions for fair comparison. Arena—everyone plays simultaneously via multiprocessing; whoever inferences faster gets more turns.
- Multi-Agent Support: Up to 4 agents, each fully configurable in the UI: system prompt, tool descriptions, sampling parameters, message history length, grid columns, etc. You can pit different model sizes against each other (0.8B vs 4B vs 9B) or different models (Qwen vs GPT-4o).
- API Compatibility: Works with any OpenAI-compatible API—LM Studio, Ollama, vLLM, OpenRouter, OpenAI, Claude. Just swap the URL and model in settings.
- Recording and Logging: Episode recording in GIF/MP4 with overlays showing HP, ammo, model decisions, and latency. Live scoreboard in Jupyter. All results saved to
workspace/folder (logs, videos, screenshots). Can download everything as a single ZIP.
Performance and Setup
Performance: On a MacBook M1 Pro 16GB, the 0.8B model takes ~10 seconds per step. On a RunPod L40S, it takes 0.5 seconds. You need a GPU for proper arena gameplay.
Quick start:
LM Studio → lms get qwen-3.5-0.8b → lms server start → pip install -r requirements.txt → jupyter lab doom_vlm.ipynb → Run All
The whole project is a single Jupyter notebook under MIT license.
Current State and Observations
The developer hasn't found universal prompts that let Qwen 3.5 consistently beat every scenario. General observation: simpler, shorter prompts yield better results; models choke on overly detailed instructions.
Flagship models like GPT-4o or Claude haven't been tested yet, though the interface supports them—you can run them from your local machine with no GPU, just plug in the API key.
The tool is now polished, and exploration of which model/prompt/setting combinations work best is just beginning. The developer encourages sharing findings: interesting prompts, surprising results with different models, settings that helped. Post gameplay videos from the workspace/ folder.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Developer builds Rust compression library with Claude Opus 4.6, questions utility
A developer used Claude Opus 4.6 for two weeks to create a 15,800-line Rust compression library with 449 passing tests, Python bindings, and C FFI layer, but questions whether another compression library was needed.

Obsidian Integration for Persistent Memory in OpenClaw and Claude Code
A Reddit user demonstrates how connecting OpenClaw and Claude Code to an Obsidian vault creates persistent long-term memory across sessions. The setup automatically links memories, context, project files, and notes, with all instances able to access shared memory when needed.

Bullshit Benchmark Tests LLM Resistance to Nonsensical Prompts
The Bullshit Benchmark evaluates whether AI models identify and push back on obvious nonsense prompts instead of confidently generating incorrect answers. Results show Claude models perform significantly better than Gemini models at detecting nonsensical questions.

Agent Browser Shield: Free OpenClaw Extension Blocks Prompt Injection & Dark Patterns
PixieBrix releases Agent Browser Shield, a free source-available browser extension for OpenClaw that blocks prompt injection, dark patterns, and context pollution while cutting token usage.