VOID: Netflix's Video Object Removal Model Released

What VOID Does

VOID removes objects from videos along with all interactions they induce on the scene — not just secondary effects like shadows and reflections, but physical interactions like objects falling when a person is removed.

Technical Requirements

Requires a GPU with 40GB+ VRAM (e.g., A100)
Built on CogVideoX-Fun-V1.5-5b-InP
Fine-tuned for video inpainting with interaction-aware quadmask conditioning
Quadmask is a 4-value mask that encodes: primary object (remove), overlap regions, affected regions (falling objects, displaced items), and background (keep)
Resolution: 384x672 (default)
Max frames: 197
Scheduler: DDIM
Precision: BF16 with FP8 quantization for memory efficiency

Model Files

void_pass1.safetensors - Base inpainting model (required)
void_pass2.safetensors - Warped-noise refinement for temporal consistency (optional)

Pass 1 is sufficient for most videos. Pass 2 adds optical flow-warped latent initialization for improved temporal consistency on longer clips.

Quick Start

The included notebook handles setup, downloads models, runs inference on a sample video, and displays the result.

git clone https://github.com/netflix/void-model.git
cd void-model

CLI Usage

# Install dependencies pip install -r requirements.txt Download the base model huggingface-cli download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP --local-dir ./CogVideoX-Fun-V1.5-5b-InP Download VOID checkpoints huggingface-cli download netflix/void-model --local-dir . Run Pass 1 inference on a sample

python inference/cogvideox_fun/predict_v2v.py --config config/quadmask_cogvideox.py --config.data.data_rootdir= "./sample" --config.experiment.run_seqs= "lime" --config.experiment.save_path= "./outputs" --config.video_model.transformer_path= "./void_pass1.safetensors"

Input Format

Each video needs three files in a folder:

input_video.mp4 - source video
quadmask_0.mp4 - 4-value mask (0=remove, 63=overlap, 127=affected, 255=keep)
prompt.json - {"bg": "description of scene after removal"}

The repo includes a mask generation pipeline (VLM-MASK-REASONER/) that creates quadmasks from raw videos using SAM2 + Gemini.

Training Details

Trained on paired counterfactual videos generated from two sources: HUMOTO (human-object interactions rendered in Blender with physics simulation) and Kubric (object-only interactions using Google Scanned Objects)
Training was run on 8x A100 80GB GPUs using DeepSpeed ZeRO Stage 2

Architecture

Base: CogVideoX 3D Transformer (5B parameters)
Input: Video + quadmask + text prompt describing the scene after removal

📖 Read the full source: HN AI Agents

Netflix Releases VOID: Video Object and Interaction Deletion Model on Hugging Face

What VOID Does

Technical Requirements

Model Files

Quick Start

CLI Usage

Download the base model

Download VOID checkpoints

Run Pass 1 inference on a sample

Input Format

Training Details

Architecture

👀 See Also

How Clawdbot Coordinates 6 AI Agents with a Production-Stable Work Queue

Storybloq: A Project Tracker Living in Your Repo's .story/ Directory Now Has a Mac App

HolyClaude: Docker Container for Claude Code with Browser UI and Headless Chromium

YourMemory: AI memory with biological decay hits 59% recall on LoCoMo-10