Gemini 3.1 Pro in Multi-Agent Systems: 20% Tool-Call Failure Rate

Architecture and Testing Context

The team behind Bobr, an AI presentation generator, tested Gemini 3.1 Pro within a two-level agent system. The architecture consists of:

Orchestrator Agent: Handles conversation, understands user intent, plans structure, and dispatches work via tool calls.
Creative Agent (Gemini 3.1 Pro in this test): Receives slide descriptions, generates images, builds templates (1920x1080), and returns results via a submit_slide tool call.

The creative agent has tools including generate_image, search_images, and submit_slide. The submit_slide call is critical—it returns a 'submit' signal, terminates the agent loop, and extracts slide data. Both agents run through the same loop with streaming, parallel tool execution, and iteration limits.

Strengths: Design and Aesthetic Output

When Gemini 3.1 Pro works correctly, it produces superior design output compared to other models tested (Claude Sonnet 4.6 and GPT-5.2). Specific strengths include:

Aesthetic intuition: Better color theory and visual hierarchy.
Layout creativity: Experiments with asymmetric compositions, overlapping elements, and modern UI styles like dark-mode/glassmorphism.
Vibe interpretation: Effectively handles vague prompts like "make it feel premium" or "tech startup vibes."
Code quality: Generates modern, structural HTML/CSS.

Critical Problems in Production

The team encountered two major reliability issues with Gemini 3.1 Pro in their agentic pipeline:

1. ~20% Tool-Call Failure Rate

In approximately 20% of requests, Gemini 3.1 Pro fails to call the required submit_slide tool. Instead, it exhibits several failure patterns:

Outputs raw HTML template as plain text, describing what it "would" create rather than triggering the tool.
Generates images correctly but stops without submitting, hitting iteration limits.
Calls image generation tools but writes natural language summaries ("Here is your beautiful slide...") instead of the final tool call.
Enters loops refining design descriptions in text without committing to action.

Since submit_slide is the hard exit path, failures result in no data returned to the orchestrator and failed user generations.

2. Garbled/Corrupted Output

The model frequently returns corrupted text in responses—random character sequences, broken Unicode, half-encoded strings. This corruption sometimes bleeds into slide content (variable values, template markup), meaning even successful submissions might display gibberish text in presentations.

Comparison with Other Models

Claude Sonnet 4.6: Near-zero failure rate on submit_slide calls in the same creative agent role, described as "boringly reliable" with no garbled output.
GPT-5.2: Moderate tool reliability between Gemini and Claude, but doesn't suffer from encoding/gibberish issues.

Attempted Mitigations

The team tried several approaches without significant improvement:

Adding aggressive explicit instructions in system prompts: "You MUST call submit_slide. Do not output the template as text."
Injecting few-shot examples showing exact expected tool-call patterns.
Reducing iteration limits to force faster convergence.
Stripping down and simplifying tool schemas.

Despite these issues, Gemini 3.1 Pro remains live in their system due to its superior design capabilities when it functions correctly.

📖 Read the full source: r/LocalLLaMA