Building a Slay the Spire 2 Agent with Local LLMs: Lessons and Open Problems

A developer has built an agent that plays Slay the Spire 2 using local LLMs via KoboldCPP/Ollama. The game is exposed as a REST API through a community mod, and the agent sits in the middle: reads game state → calls LLM with tools → executes the action → repeat.
Setup and Performance
Setup uses Qwen3.5-27B (Q4_K_M) on RTX 4090 via KoboldCPP. Performance metrics: ~10 seconds per action, ~88% action success rate. Best result: beating the Act 1 boss. The project is available on GitHub at https://github.com/Alex5418/STS2-Agent.
What Works
- State-based tool routing — Instead of exposing 20+ tools at once, only 1-3 tools relevant to the current game state are provided. Combat gets
play_card,end_turn,use_potion. Map screen getschoose_map_node. This dramatically reduced hallucinated tool calls. - Single-tool mode — Small models can't predict how game state changes after an action (e.g., card indices shift after playing a card). So only the first tool call per response is executed, then game state is re-fetched and the model is asked again. Slower but much more reliable.
- Text-based tool call parser (fallback) — KoboldCPP often outputs tool calls as text instead of structured JSON. A multi-pattern regex fallback catches formats like:
json [{"name": "play_card", "arguments": {...}}],Made a function call ... to play_card with arguments = {...},play_card({"card_index": 1, "target": "NIBBIT_0"}), and bare mentions of no-arg tools likeend_turn. This recovers maybe 15-20% of actions that would otherwise be lost. - Energy guard — Client-side tracking of remaining energy. If the model tries to play a card it can't afford, the API call is blocked and the turn is auto-ended. This prevents the most common error loop (model retries the same unaffordable card 3+ times).
- Smart-wait for enemy turns — During the enemy's turn, the game state says "Play Phase: False." Instead of wasting an LLM call on this, the agent polls every 1s until it's the player's turn again.
Open Problems
- Model doesn't follow system prompt rules consistently — System prompt says things like "if enemy intent is Attack, play Defend cards FIRST." The model follows this maybe 30% of the time. The other 70% it just plays attacks regardless. Attempted solutions: stronger wording ("You MUST block first"), few-shot examples in the prompt, injecting computed hints ("WARNING: 15 incoming damage"). None are reliable. Question: Is there a better prompting strategy for getting small models to follow conditional rules? Or is this a fundamental limitation at 27B?
- Tool calling reliability with KoboldCPP — Even with the text fallback parser, about 12% of responses produce no usable tool call. The model sometimes outputs empty
<think></think>blocks followed by malformed JSON. The Ollama OpenAI compatibility layer also occasionally returnsargumentsas a string instead of a dict. Question: Has anyone found a model that's particularly reliable at tool calling at the 14-30B range? The developer has tried Phi-4 (14B) briefly but hasn't done a proper comparison. Considering Mistral-Small or Command-R. - Context window management — Each game state is ~800-1500 tokens as markdown. With system prompt (~500 tokens) and conversation history, context fills up fast. Currently keeps only the last 5 exchanges and resets history on state transitions (combat → map, etc.). But the model has no memory across fights — it can't learn from mistakes. Question: Would a rolling summary approach work? Like condensing the last combat into "You fought Jaw Worm. Took 15 damage because you didn't block turn 2. Won in 4 turns."
- Better structured output from local models — The core problem is needing the model to output a JSON tool call, but what it really wants to do is think in natural language first. Qwen3.5 uses
<think>blocks which are stripped out, but sometimes the thinking and the tool call get tangled together. Question: Would a two-stage approach work better? Stage 1: "Analyze the game state and decide what to do" (free text). Stage 2: "Now output exactly one tool call" (constrained). This doubles latency but might improve reliability. Has anyone tried this pattern? - A/B testing across models — The developer has a JSONL logging system that records actions for comparison.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Building Custom Image Analysis Skills in OpenClaw with Local Models
A developer created a custom OpenClaw skill to analyze images using Qwen2.5 VL running locally via Ollama on Windows 11 with WSL, bypassing the WebUI's image limitations through API calls and custom scripts.

Practical Lessons from Building a 350K-Line Codebase Solo with AI Agents
A developer shares concrete engineering insights from building a 356K-line production codebase in 52 days using AI agents, including how codebase structure affects agent output and why strong typing is essential.

Claude Code AI Agent Controls Physical iPhone via Accessibility APIs
A developer demonstrated Claude Code autonomously operating a physical iPhone through the Blitz Mac app, using WebDriverAgent and accessibility APIs with a zero-distance swipe workaround for taps.

Practical experience replacing automation stack with MCP servers and local LLMs
A developer shares results from 4 months of running personal automation infrastructure using MCP servers with Qwen 2.5 32B and Llama 3.3 70B models on dual 3090 hardware, detailing what works well and what doesn't.