Cull: Open-Source Dataset Curation Engine for AI Image Pipelines

✍️ OpenClawRadar📅 Published: May 10, 2026🔗 Source
Cull: Open-Source Dataset Curation Engine for AI Image Pipelines
Ad

Cull is a machine curation engine for AI image datasets, built and maintained by u/Compunerd3. It automates the entire pipeline: scraping, classifying, captioning, and sorting — outputting a folder of triaged images with SD prompts ready for LoRA or finetune training.

End-to-End Pipeline

  • Scraping: Supports Civitai (.com and .red), X/Twitter, Reddit, Discord, and any URL gallery-dl supports — Pixiv, DeviantArt, booru family, ArtStation, Tumblr, FurAffinity/e621, Imgur, Flickr, and ~340 others.
  • Queue: Each image + source-side prompt dropped into a local queue. Per-source dedup, no database.
  • Classification: Uses a vision-language model via multiple LM Studio instances (local) or Groq (cloud) — any OpenAI-compatible endpoint. Strict 17-field JSON schema ensures structured output.
  • Sorting: Keepers go into category folders with a .txt prompt and a .vision.json audit record. Two score gates (quality + topic relevance) tunable in the UI.
  • Dashboard: Flask + Alpine.js UI with start/stop, source toggles, gallery, prompt editor, ZIP export, and per-source stats.

Use Cases

The author used Cull for a 300-image LoRA and a 100,000-image finetune dataset. Set topic (e.g., "Female Influencer" or {artist} style art), toggle AUTO_CAPTION_ENABLED, walk away. For prompt-less archives, point LOCAL_IMPORT_DIR at a folder of JPEGs, toggle off prompt requirement, and turn on auto-captioning — each image gets an SD prompt, booru tags, or natural-language caption.

Ad

Technical Details

  • Vision worker pluggable: Subclass BaseVisionWorker, register. Two LM Studio endpoints run in parallel; keepalive worker pings every 15s to avoid idle-unload; optional idle-unloader to free VRAM.
  • AI assistant integration: Ships with Claude Code skill bundle in .claude/skills/ (cull-helper, lmstudio-vision, metadata-schema) and three sub-agents — works with Claude Code, Cursor, Aider, Codex.
  • Self-updater: Toast in dashboard, click Update, pulls from origin/main and relaunches.
  • Stack: Python 3.10+, Flask, Alpine.js, Pillow, Playwright (X scraper), gallery-dl. Single machine, no Redis, no DB, no Docker.
  • License: MIT.

Roadmap

Planned: more vision-worker backends, improved requeue UI, small headless CLI, video scraping and classification.

Repo: https://github.com/tlennon-ie/cull | Screenshots: https://imgur.com/a/kSvsAW9

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also