Security Benchmark: 10 LLMs Tested Against 211 Adversarial Probes

✍️ OpenClawRadar📅 Published: March 8, 2026🔗 Source
Security Benchmark: 10 LLMs Tested Against 211 Adversarial Probes
Ad

A security researcher conducted a systematic test of 10 different LLMs against 211 adversarial security probes to evaluate how they handle attacks in real-world scenarios.

Test Methodology

The researcher used a standardized setup with temperature 0 and identical API calls for every model. The test included 82 extraction probes (attempting to steal system prompts) and 109 injection probes (attempting to hijack model behavior). A honeypot system prompt loaded with fake PII, SSH keys, and API credentials was used as bait.

Key Findings

  • Extraction resistance is mostly solved: Most models are decent at blocking "repeat your system prompt" type attacks. The average across all models is around 85%.
  • Injection resistance is not solved: Average is 46.2%, meaning more than half of injection attacks succeed across the board.
  • Universal failures: Every single model failed on delimiter attacks, distractor injection, and style injection. 0% resistance on those categories across all 10 models.
  • Dead attack patterns: Every model resisted payload splitting and typo evasion at 100%.
Ad

Model-Specific Results

  • Claude Opus: Scored 72.7% on injection resistance, the best of any model tested. Still means over 1 in 4 injection attacks work.
  • GPT-5.4: Has perfect extraction and boundary scores but only 50% injection resistance.
  • GPT-5.3 Codex: The model behind Codex CLI that runs code on your machine scored 34.5% on injection. 2 out of 3 injection attempts succeed.
  • DeepSeek V3.2: Scored 17.4% on injection, basically no resistance.
  • Qwen 3.5 API vs local: Almost identical extraction (81.6% vs 81.7%) but the local version is worse on injection (46.9% vs 29.8%) and much worse on boundary integrity (59.8% vs 44.6%). Running locally doesn't make it less capable at blocking extraction but does make it more vulnerable to injection.

Why Injection Matters

Extraction means someone steals your system prompt - bad, but recoverable. Injection means someone hijacks what your agent does. If your agent has tool access, file system access, or can make API calls, a successful injection can lead to data exfiltration, file deletion, or worse. Right now the best model in the world only blocks 73% of injection attempts.

Full methodology and results are public at agentseal.org/benchmark. The test prompt is also published so anyone can reproduce the results.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also

Claude Code VS Code Extension Leaks Selection State Across Closed Files and New Sessions
Security

Claude Code VS Code Extension Leaks Selection State Across Closed Files and New Sessions

A bug in Claude Code's VS Code extension caches file selection state even after the file is closed, exposing sensitive data (e.g., Supabase service-role keys) to a brand new CLI session. Full repro steps and GitHub issue #58886.

OpenClawRadar
AI Agents Enable Solo Hackers to Breach Governments and Ransomware Campaigns
Security

AI Agents Enable Solo Hackers to Breach Governments and Ransomware Campaigns

A solo operator using Claude Code and ChatGPT exfiltrated 150 GB from Mexican government agencies, including 195 million taxpayer records. Another attacker used Claude Code to run an end-to-end extortion campaign against 17 healthcare and emergency services organizations.

OpenClawRadar
Open-source RAG attack and defense lab for local ChromaDB + LM Studio stacks
Security

Open-source RAG attack and defense lab for local ChromaDB + LM Studio stacks

An open-source lab measures RAG knowledge base poisoning effectiveness on default local setups with ChromaDB and LM Studio, showing 95% success rate on undefended systems and evaluating practical defenses.

OpenClawRadar
Open Source AI Tools Pose Security Risks Through 'Illusory Security Through Transparency'
Security

Open Source AI Tools Pose Security Risks Through 'Illusory Security Through Transparency'

A Reddit post warns about malware disguised as open-source AI agents and tools, where malicious code can be hidden in large codebases that users assume are safe because they're on GitHub. The post describes how 'vibe-coding' and autonomous AI agents condition users to run unknown programs without review.

OpenClawRadar