PolyRange: Contamination-Resistant Offensive-AI Benchmark with LLM-Generated Targets

PolyRange v1.0 is an MIT-licensed, contamination-resistant offensive-AI benchmark for web security agents. Instead of static targets that leak into training corpora, each PolyRange deploy is freshly generated by the researcher's choice of LLM — satisfying the 'newly constructed tasks' criterion that OpenAI, Anthropic, and UK AISI have publicly called for.
What PolyRange addresses
The author, CEO of Aether AI, notes that existing cyber-AI benchmarks fall into two lanes that don't measure what labs need: CTF-style benchmarks (DVWA, NYU CTF Bench, CyberGym, AutoPenBench) use static targets that contaminate future models, and bug-bounty-style benchmarks (XBOW) have undefined defensive infrastructure. PolyRange bridges this gap with production-shape conditions including active defenders.
Technical specifications
- 84 WSTG-derived classes spanning all 12 OWASP testing-guide categories
- Two defense tiers approximating active-defender conditions
- Real backends: Postgres dialects, real PHP for LFI, real shell for command injection, real Jinja2 for SSTI
- Agent-submits-flag oracle convention for scoring
- Single-command eval CLI
- Self-hostable on Fly.io or any Docker host
Because targets are regenerated per run via LLM (researcher's choice of generator model), there is no static artifact for future models to ingest — addressing Anthropic's concern that 'this report will, itself, likely contribute to the problem.'
The benchmark uses a two-bucket entropy framing separating exploit-recall axes from cosmetic/realism axes, which the author believes is over-conflated in adjacent benchmark literature.
Funding for a full empirical paper (with publishable-N results) depends on partnership funding, but the framework is available now.
📖 Read the full source: r/LocalLLaMA
👀 See Also

AI Security Researchers: Your 0-Day Vulnerabilities May Leak via Data Opt-In Toggle
The 'Improve the model for everyone' toggle in LLM interfaces can automatically harvest deep red-teaming research, sending your vulnerability concepts to vendor safety teams and potentially to academic papers before you publish. Disable data sharing before conducting serious security research.

Security Audit Experiment Shows AI Agent Performance Depends on Knowledge Access
A developer ran three security audits on the same Next.js codebase using different AI approaches: Claude Code's built-in review found 1 critical, 6 high, 13 medium issues; an AI agent without extra context found 1 critical, 5 high, 14 medium; an AI agent with 10 professional security books found 8 critical, 9 high, 10 medium issues.

Google Reports AI-Powered Hacking Reached Industrial Scale in 3 Months
Google's threat intelligence group found criminal and state groups are using commercial AI models (Gemini, Claude, OpenAI) to refine and scale attacks. A group nearly leveraged a zero-day for mass exploitation, and others are experimenting with the unguarded OpenClaw agent.

AI Chatbots Leaking Real Phone Numbers: The PII Exposure Problem
Chatbots like Gemini, ChatGPT, and Claude are exposing real personal phone numbers due to PII in training data. DeleteMe reports a 400% increase in AI-related privacy requests in seven months.