SPLICE Benchmark Reveals VLMs Struggle with Temporal Reasoning, Rely on Language Priors

SPLICE Benchmark Results
The SPLICE benchmark tests temporal, causal, spatial, contextual, and common sense reasoning by having models reconstruct the correct sequence of shuffled video clips. The research, co-authored by the source poster, was published at EMNLP 2025.
Model Performance Details
Tested models included Gemini Flash (1.5 and 2.0), Qwen2-VL (7B and 72B), InternVL2.5, and LLaVA-OneVision. Gemini 2.0 Flash scored 51% on the vision-only task, while human performance was 85%. Open-source models struggled significantly:
- LLaVA-OneVision-72B scored barely above random guessing in vision-only setting
- InternVL2.5-78B performed similarly poorly
- Qwen2-VL-72B reached only around 30% on vision-only
- Qwen2-VL-7B performed on par with the 72B variant, suggesting scaling the language model doesn't help when the bottleneck is in the vision encoder
Language Prior Dependency
When human-written text annotations describing clip content were added, model performance jumped significantly while human performance remained unchanged. This indicates models rely on language priors to compensate for weak visual understanding. Notably, Qwen2-VL-72B outperformed Gemini on text-only reasoning.
Visual Shortcut Behavior
Models demonstrated problematic reasoning patterns. When first and last video clips looked visually similar (like opening and closing a printer door), models predicted those clips were adjacent 57% of the time, compared to 2.5% for humans and 27% random chance. This suggests models are pattern matching on visual similarity rather than reasoning about events.
Testing Limitations and Future Work
The research didn't test Claude (which doesn't support video input) or OpenAI models (which couldn't handle multi-video input reliably at testing time). The dataset is public, and the poster notes newer models like Gemini 3 Flash and Qwen3-VL (with native 256K interleaved context, enhanced spatial-temporal modeling, and MoE variants up to 235B) should be tested on SPLICE to see if language prior issues persist. Preliminary testing suggests the language prior problem remains, though statistical significance hasn't been established across all experimental samples.
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenClaw Early User Reports Telegram Issues, Agent Profile Hardcoding, and Session Reset Problems
A user's first three days with OpenClaw revealed several practical challenges: Telegram responses disappearing, agent profiles hardcoded to 'messaging' in source code, and Wacli becoming unavailable after session resets. The user ran micro tests on Docker, connected Telegram and Wacli, and set up a heartbeat.

OpenClaw Ecosystem Growth and Key Players Mapped
A community member has mapped the OpenClaw ecosystem's rapid expansion, noting 230K+ GitHub stars, 116K+ Discord members, and emerging companies in managed hosting, LLM routing, and security layers within 60 days of launch.

OpenClaw Developer Reports Context Compaction Issues During Driftwatch V3 Build
An OpenClaw developer completed sprints 2-4 of the Driftwatch V3 build but encountered context compaction problems that wiped the AI agent's memory mid-session, requiring manual intervention to restore progress using sprint recaps.

Qwen3.5-122B on Blackwell SM120: fp8 KV Cache Corruption Issue and Performance Findings
Testing Qwen3.5-122B on 8x RTX PRO 6000 Blackwell hardware revealed that fp8_e4m3 KV cache silently produces corrupt output without errors, requiring bf16 KV cache instead. MTP optimization provided a 2.75x single-request speedup while DeltaNet constraints blocked other optimizations.