Cull: Open-Source Dataset Curation Engine for AI Image Pipelines

Cull is a machine curation engine for AI image datasets, built and maintained by u/Compunerd3. It automates the entire pipeline: scraping, classifying, captioning, and sorting — outputting a folder of triaged images with SD prompts ready for LoRA or finetune training.
End-to-End Pipeline
- Scraping: Supports Civitai (.com and .red), X/Twitter, Reddit, Discord, and any URL gallery-dl supports — Pixiv, DeviantArt, booru family, ArtStation, Tumblr, FurAffinity/e621, Imgur, Flickr, and ~340 others.
- Queue: Each image + source-side prompt dropped into a local queue. Per-source dedup, no database.
- Classification: Uses a vision-language model via multiple LM Studio instances (local) or Groq (cloud) — any OpenAI-compatible endpoint. Strict 17-field JSON schema ensures structured output.
- Sorting: Keepers go into category folders with a .txt prompt and a .vision.json audit record. Two score gates (quality + topic relevance) tunable in the UI.
- Dashboard: Flask + Alpine.js UI with start/stop, source toggles, gallery, prompt editor, ZIP export, and per-source stats.
Use Cases
The author used Cull for a 300-image LoRA and a 100,000-image finetune dataset. Set topic (e.g., "Female Influencer" or {artist} style art), toggle AUTO_CAPTION_ENABLED, walk away. For prompt-less archives, point LOCAL_IMPORT_DIR at a folder of JPEGs, toggle off prompt requirement, and turn on auto-captioning — each image gets an SD prompt, booru tags, or natural-language caption.
Technical Details
- Vision worker pluggable: Subclass
BaseVisionWorker, register. Two LM Studio endpoints run in parallel; keepalive worker pings every 15s to avoid idle-unload; optional idle-unloader to free VRAM. - AI assistant integration: Ships with Claude Code skill bundle in
.claude/skills/(cull-helper, lmstudio-vision, metadata-schema) and three sub-agents — works with Claude Code, Cursor, Aider, Codex. - Self-updater: Toast in dashboard, click Update, pulls from origin/main and relaunches.
- Stack: Python 3.10+, Flask, Alpine.js, Pillow, Playwright (X scraper), gallery-dl. Single machine, no Redis, no DB, no Docker.
- License: MIT.
Roadmap
Planned: more vision-worker backends, improved requeue UI, small headless CLI, video scraping and classification.
Repo: https://github.com/tlennon-ie/cull | Screenshots: https://imgur.com/a/kSvsAW9
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenClaw developer builds unified memory system for AI agents
A developer has built a 15-tool unified memory system for OpenClaw AI agents that combines structured facts, vector search, entity graphs, episode timelines, hierarchical compression, and event-driven coordination. The system runs locally without cloud dependencies or monthly fees.

Can OpenClaw Embrace the Power of Claude CLI?
Explore key insights from r/openclaw on whether OpenClaw can integrate with Claude CLI, a powerful AI tool designed to enhance coding and automation processes.

Garry Tan's gstack: An Open Source AI Agent Framework for Claude Code
Garry Tan's gstack is an open source software factory that turns Claude Code into a virtual engineering team with 13 specialist slash commands for planning, design, engineering, review, QA, and release management.

sourcecode: Open-Source CLI to Compress Large Java/Spring Monorepos for Claude
sourcecode CLI reduces a ~4k-file Java/Spring monorepo from ~3M tokens to 1.7k tokens (compact mode). Currently focuses on context compression, git hotspot detection, and symbol lookup.