DoomVLM: Test Vision Language Models in Doom Deathmatches

What DoomVLM Does

DoomVLM is a Jupyter notebook that tests vision language models (VLMs) by having them play Doom. It takes screenshots from ViZDoom, draws a numbered column grid on top, and sends the image to any VLM via an OpenAI-compatible API. The model has two tools: shoot(column) and move(direction), with tool_choice: "required". This is pure vision inference—no reinforcement learning or fine-tuning.

Key Features and Updates

Deathmatch Modes: Two modes added. Benchmark—models take turns playing against bots under identical conditions for fair comparison. Arena—everyone plays simultaneously via multiprocessing; whoever inferences faster gets more turns.
Multi-Agent Support: Up to 4 agents, each fully configurable in the UI: system prompt, tool descriptions, sampling parameters, message history length, grid columns, etc. You can pit different model sizes against each other (0.8B vs 4B vs 9B) or different models (Qwen vs GPT-4o).
API Compatibility: Works with any OpenAI-compatible API—LM Studio, Ollama, vLLM, OpenRouter, OpenAI, Claude. Just swap the URL and model in settings.
Recording and Logging: Episode recording in GIF/MP4 with overlays showing HP, ammo, model decisions, and latency. Live scoreboard in Jupyter. All results saved to workspace/ folder (logs, videos, screenshots). Can download everything as a single ZIP.

Performance and Setup

Performance: On a MacBook M1 Pro 16GB, the 0.8B model takes ~10 seconds per step. On a RunPod L40S, it takes 0.5 seconds. You need a GPU for proper arena gameplay.

Quick start:

LM Studio → lms get qwen-3.5-0.8b → lms server start → pip install -r requirements.txt → jupyter lab doom_vlm.ipynb → Run All

The whole project is a single Jupyter notebook under MIT license.

Current State and Observations

The developer hasn't found universal prompts that let Qwen 3.5 consistently beat every scenario. General observation: simpler, shorter prompts yield better results; models choke on overly detailed instructions.

Flagship models like GPT-4o or Claude haven't been tested yet, though the interface supports them—you can run them from your local machine with no GPU, just plug in the API key.

The tool is now polished, and exploration of which model/prompt/setting combinations work best is just beginning. The developer encourages sharing findings: interesting prompts, surprising results with different models, settings that helped. Post gameplay videos from the workspace/ folder.

📖 Read the full source: r/LocalLLaMA