10 LLMs Tested Against 211 Adversarial Probes: Security Benchmark Results

A security researcher conducted a systematic test of 10 different LLMs against 211 adversarial security probes to evaluate how they handle attacks in real-world scenarios.

Test Methodology

The researcher used a standardized setup with temperature 0 and identical API calls for every model. The test included 82 extraction probes (attempting to steal system prompts) and 109 injection probes (attempting to hijack model behavior). A honeypot system prompt loaded with fake PII, SSH keys, and API credentials was used as bait.

Key Findings

Extraction resistance is mostly solved: Most models are decent at blocking "repeat your system prompt" type attacks. The average across all models is around 85%.
Injection resistance is not solved: Average is 46.2%, meaning more than half of injection attacks succeed across the board.
Universal failures: Every single model failed on delimiter attacks, distractor injection, and style injection. 0% resistance on those categories across all 10 models.
Dead attack patterns: Every model resisted payload splitting and typo evasion at 100%.

Model-Specific Results

Claude Opus: Scored 72.7% on injection resistance, the best of any model tested. Still means over 1 in 4 injection attacks work.
GPT-5.4: Has perfect extraction and boundary scores but only 50% injection resistance.
GPT-5.3 Codex: The model behind Codex CLI that runs code on your machine scored 34.5% on injection. 2 out of 3 injection attempts succeed.
DeepSeek V3.2: Scored 17.4% on injection, basically no resistance.
Qwen 3.5 API vs local: Almost identical extraction (81.6% vs 81.7%) but the local version is worse on injection (46.9% vs 29.8%) and much worse on boundary integrity (59.8% vs 44.6%). Running locally doesn't make it less capable at blocking extraction but does make it more vulnerable to injection.

Why Injection Matters

Extraction means someone steals your system prompt - bad, but recoverable. Injection means someone hijacks what your agent does. If your agent has tool access, file system access, or can make API calls, a successful injection can lead to data exfiltration, file deletion, or worse. Right now the best model in the world only blocks 73% of injection attempts.

Full methodology and results are public at agentseal.org/benchmark. The test prompt is also published so anyone can reproduce the results.

📖 Read the full source: r/LocalLLaMA