Building a Discord Cat Monitoring Bot with ESP32-S3, MiniClaw, and Multimodal AI

✍️ OpenClawRadar📅 Published: March 8, 2026🔗 Source
Building a Discord Cat Monitoring Bot with ESP32-S3, MiniClaw, and Multimodal AI
Ad

Edge Agent Setup for Cat Monitoring

A developer created a Discord bot that monitors their cat using an ESP32-S3 Sense as an edge agent. The system captures photos or records audio when triggered via Discord mentions, then sends the media to a multimodal LLM for analysis.

Hardware and Software Stack

The implementation uses specific components:

  • Hardware: XIAO ESP32-S3 Sense (Vision version) - small enough to hide in a cat tree
  • Communication: Web UI + WebSocket setup for low-latency debugging
  • AI Model: Zhipu AI's VLM-4V multimodal model
  • Platform: Discord for bot interaction

How It Works

The workflow is straightforward: when someone @mentions the bot on Discord, the ESP32-S3 either snaps a photo or records audio. This media gets sent to the VLM (Vision-Language Model), which analyzes it and returns natural language descriptions of what's happening. Instead of getting "Motion Detected" spam, users receive specific descriptions like "Your cat is sleeping on the couch" or "Cat is playing with a toy."

Ad

Current Limitations and Future Plans

The developer identified several areas for improvement:

  • Image Quality: Current captures are "pretty blurry" and "mediocre" but functional
  • Fixed Position: The device has a fixed POV - considering adding mobility via servo brackets or rover mechanics
  • Audio Intelligence: Planning to add vocalization classification to distinguish between hungry meows, zoomies, or general yelling

The developer notes the implementation was "surprisingly straightforward" and works better than expected, with the VLM analysis being "surprisingly spot-on" despite the blurry image quality.

📖 Read the full source: r/openclaw

Ad

👀 See Also