Build a Discord Cat Bot with ESP32-S3 & VLM-4V AI

Edge Agent Setup for Cat Monitoring

A developer created a Discord bot that monitors their cat using an ESP32-S3 Sense as an edge agent. The system captures photos or records audio when triggered via Discord mentions, then sends the media to a multimodal LLM for analysis.

Hardware and Software Stack

The implementation uses specific components:

Hardware: XIAO ESP32-S3 Sense (Vision version) - small enough to hide in a cat tree
Communication: Web UI + WebSocket setup for low-latency debugging
AI Model: Zhipu AI's VLM-4V multimodal model
Platform: Discord for bot interaction

How It Works

The workflow is straightforward: when someone @mentions the bot on Discord, the ESP32-S3 either snaps a photo or records audio. This media gets sent to the VLM (Vision-Language Model), which analyzes it and returns natural language descriptions of what's happening. Instead of getting "Motion Detected" spam, users receive specific descriptions like "Your cat is sleeping on the couch" or "Cat is playing with a toy."

Current Limitations and Future Plans

The developer identified several areas for improvement:

Image Quality: Current captures are "pretty blurry" and "mediocre" but functional
Fixed Position: The device has a fixed POV - considering adding mobility via servo brackets or rover mechanics
Audio Intelligence: Planning to add vocalization classification to distinguish between hungry meows, zoomies, or general yelling

The developer notes the implementation was "surprisingly straightforward" and works better than expected, with the VLM analysis being "surprisingly spot-on" despite the blurry image quality.

📖 Read the full source: r/openclaw