Security Benchmark: 10 LLMs Tested Against 211 Adversarial Probes

A security researcher conducted a systematic test of 10 different LLMs against 211 adversarial security probes to evaluate how they handle attacks in real-world scenarios.
Test Methodology
The researcher used a standardized setup with temperature 0 and identical API calls for every model. The test included 82 extraction probes (attempting to steal system prompts) and 109 injection probes (attempting to hijack model behavior). A honeypot system prompt loaded with fake PII, SSH keys, and API credentials was used as bait.
Key Findings
- Extraction resistance is mostly solved: Most models are decent at blocking "repeat your system prompt" type attacks. The average across all models is around 85%.
- Injection resistance is not solved: Average is 46.2%, meaning more than half of injection attacks succeed across the board.
- Universal failures: Every single model failed on delimiter attacks, distractor injection, and style injection. 0% resistance on those categories across all 10 models.
- Dead attack patterns: Every model resisted payload splitting and typo evasion at 100%.
Model-Specific Results
- Claude Opus: Scored 72.7% on injection resistance, the best of any model tested. Still means over 1 in 4 injection attacks work.
- GPT-5.4: Has perfect extraction and boundary scores but only 50% injection resistance.
- GPT-5.3 Codex: The model behind Codex CLI that runs code on your machine scored 34.5% on injection. 2 out of 3 injection attempts succeed.
- DeepSeek V3.2: Scored 17.4% on injection, basically no resistance.
- Qwen 3.5 API vs local: Almost identical extraction (81.6% vs 81.7%) but the local version is worse on injection (46.9% vs 29.8%) and much worse on boundary integrity (59.8% vs 44.6%). Running locally doesn't make it less capable at blocking extraction but does make it more vulnerable to injection.
Why Injection Matters
Extraction means someone steals your system prompt - bad, but recoverable. Injection means someone hijacks what your agent does. If your agent has tool access, file system access, or can make API calls, a successful injection can lead to data exfiltration, file deletion, or worse. Right now the best model in the world only blocks 73% of injection attempts.
Full methodology and results are public at agentseal.org/benchmark. The test prompt is also published so anyone can reproduce the results.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude Code VS Code Extension Leaks Selection State Across Closed Files and New Sessions
A bug in Claude Code's VS Code extension caches file selection state even after the file is closed, exposing sensitive data (e.g., Supabase service-role keys) to a brand new CLI session. Full repro steps and GitHub issue #58886.

AI Agents Enable Solo Hackers to Breach Governments and Ransomware Campaigns
A solo operator using Claude Code and ChatGPT exfiltrated 150 GB from Mexican government agencies, including 195 million taxpayer records. Another attacker used Claude Code to run an end-to-end extortion campaign against 17 healthcare and emergency services organizations.

Open-source RAG attack and defense lab for local ChromaDB + LM Studio stacks
An open-source lab measures RAG knowledge base poisoning effectiveness on default local setups with ChromaDB and LM Studio, showing 95% success rate on undefended systems and evaluating practical defenses.

Open Source AI Tools Pose Security Risks Through 'Illusory Security Through Transparency'
A Reddit post warns about malware disguised as open-source AI agents and tools, where malicious code can be hidden in large codebases that users assume are safe because they're on GitHub. The post describes how 'vibe-coding' and autonomous AI agents condition users to run unknown programs without review.