civStation VLM Plays Civilization VI via Natural Language

What civStation Does

civStation is a vision-language model (VLM) system that enables playing Civilization VI through natural language commands. Instead of direct mouse/keyboard control, users issue high-level strategic intents that the system translates into actual game actions.

Architecture and Functionality

The system employs a 3-layer architecture:

Strategy Layer: Converts natural language commands into structured goals, maintains long-term direction, and performs task decomposition. Commands like "expand to the east," "focus on economy," or "aim for a science victory" are processed here.
Action Layer: Uses screen-based VLM for state interpretation and executes mouse/keyboard actions without accessing game APIs.
HITL Layer: Enables real-time human intervention, override capabilities, and controllable autonomy.

Technical Implementation Details

One strategic command generates multiple action sequences, requiring approximately 2–16 model calls per task. The system uses sub-agent based execution for bounded tasks such as city management and unit control.

civStation explores shifting interfaces from "action → intent" instead of traditional reinforcement learning, imitation learning, or scripted approaches. This represents a move from direct manipulation to delegation and agent orchestration.

Key Challenges and Limitations

The system faces several technical challenges:

VLM perception errors
Execution drift
Lack of reliable verification mechanisms

Multi-step execution introduces latency and API cost trade-offs, with fallback strategies that degrade performance. The system is not fully autonomous—it supports human-in-the-loop for real-time strategy correction and control.

Broader Implications

This experimental system tackles agent control and verification in UI-only environments. The focus extends beyond gameplay to elevating the human-system interface to the strategy level, enabling users to operate at higher abstraction levels rather than managing individual actions.

📖 Read the full source: r/ClaudeAI