Running a 6-agent behavioral coaching pipeline on self-hosted Qwen3 235B with vLLM

✍️ OpenClawRadar📅 Published: April 1, 2026🔗 Source
Running a 6-agent behavioral coaching pipeline on self-hosted Qwen3 235B with vLLM
Ad

Multi-agent behavioral coaching system

A developer has implemented a 6-agent cognitive pipeline for behavioral coaching that runs entirely on self-hosted Qwen3 models via vLLM. The system uses Claude Code instances as agents calling a vLLM endpoint, with four specialist agents firing simultaneously on each user message.

Hardware and setup

  • Development: Qwen3 30B on 2x RTX 4090s
  • Production: Qwen3 235B on RunPod A40 pods
  • All 6 agents are Claude Code instances calling the vLLM endpoint

Pipeline architecture

Each user message triggers 6 agents in sequence:

  • Shadow - Runs first, writes cross-session behavioral patterns to a shared blackboard (stated goals vs revealed priorities, follow-through prediction, pattern classification)
  • Persona - OCEAN scoring, recurring goal detection, follow-through prediction percentages, growth edge identification
  • Plasticity - Personality-informed coaching strategy, maps OCEAN scores to communication preferences
  • Stability - Risk framework with severity/detectability/reversibility ratings, identifies blocked moves the coach should not suggest
  • Coach - Fires early for an immediate response while the other agents process (~seconds)
  • Synth (Pineal) - Merges all worker outputs, applies voice calibration, delivers the full response
Ad

Performance characteristics

The user sees an immediate Coach response, then the full synthesis appends approximately 40 seconds later on 2x RTX 4090s. On the A40 configuration, this takes about 108 seconds - counterintuitively slower due to different memory architecture.

Key implementation insights

What worked:

  • Parallel dispatch is the key unlock for performance
  • Shadow must write first because synthesis needs the blackboard content to aggregate correctly
  • The sequencing logic to guarantee Shadow completes before Synth picks up adds meaningful complexity but is non-negotiable
  • Context management at 235B scale is expensive - each agent gets a full context brief plus session history
  • Aggressive compaction between sessions and tight per-agent context budgets have been the main reliability levers

What is hard:

  • Getting agents to write structured output reliably enough for synthesis to aggregate without hallucinating merge artifacts
  • Main failure mode: Synth seeing conflicting signals from Persona and Stability on the same session

The developer is seeking input from others running multi-agent systems on self-hosted inference, particularly regarding parallelism strategies at 235B scale.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also

Validating Product Ideas with Claude Code and Remotion Demos
Use Cases

Validating Product Ideas with Claude Code and Remotion Demos

A developer used Claude Code and Remotion to build a 60-second concept demo for a TypeScript YouTube MCP tool before writing any production code, spending about 2 hours total. The demo validated the idea by showing semantic search across 50 lectures with sqlite-vec and no API key requirement.

OpenClawRadar
OpenClaw and Chorus: A Product Pipeline Built by Two Humans and AI Agents in One Week
Use Cases

OpenClaw and Chorus: A Product Pipeline Built by Two Humans and AI Agents in One Week

OpenClaw and Chorus combine to create a product development pipeline where AI agents handle research, product management, and coding while humans propose ideas and approve work. The system was built in under a week by two people with day jobs using OpenClaw as a persistent product manager agent.

OpenClawRadar
OpenClaw workflow automates meeting follow-ups, replaces Granola for user
Use Cases

OpenClaw workflow automates meeting follow-ups, replaces Granola for user

A user replaced their $14/month Granola subscription with an OpenClaw workflow that transcribes meetings via STT, generates summaries on WhatsApp, breaks out action items, and creates draft follow-up emails automatically.

OpenClawRadar
Local LLM Pipeline Context Drift Issue in Multi-Step Agentic Work
Use Cases

Local LLM Pipeline Context Drift Issue in Multi-Step Agentic Work

A developer running a multi-step job search automation pipeline on Llama-3.3-70b-versatile found local Ollama models struggled with context coherence across 5-6 node pipelines, while Groq's free tier with Claude performed better. The developer also noted free tier models get retired without warning, breaking configurations.

OpenClawRadar