Intelligent Security Requires Validation & Orchestration
What DARPA’s CRSes and XBOW teach us about building systems that actually work.
Hello Cyber Builders 🖖
I’m continuing the series on Intelligent Security. Last week, we explored the AIxCC challenge. This week, I want to take a step back and make sense of the wave of new benchmarks emerging across the internet.
If you’re like me, reading through them often leaves you wondering: what is AI really capable of today? On one side, we see bold marketing claims from vendors. On the other hand, “cyber safety” labs publish weak or incomplete results. In contrast, we see DARPA-style contests or highly specialized startups achieving impressive breakthroughs.
In this post, I’ll share my view on how the best teams are actually using AI — and more importantly, what it really takes to integrate it into a cybersecurity product. Spoiler: it’s not just about the model; it’s about how expertise, tools, and AI are combined.
AI Benchmarks Are Multiplying — But Are They Misleading for Cybersecurity?
In recent months, Cyber AI benchmarks have emerged—from OpenAI’s new GPT‑5 System Card to the joint evaluation between Anthropic and Pattern Labs, Inspect framework, and CyBench.
Although each brings valuable insights, they share a subtle yet critical assumption: they evaluate AI as standalone reasoning agents, operating autonomously in a ReAct-style loop (reasoning + action) rather than as parts of a broader ecosystem of cyber tools and cooperation with humans.
I already covered GPT-5 and the CyBench/BountyBench frameworks:
My point of view is that these evaluations provide an essential benchmark, but they are incomplete and quite misleading for the cybersecurity community. There are multiple reasons.
First, they see that they are simulating CTF environments that are often forged and quite complex. These environments are inherited from CTF contests, where the goal is to distinguish highly skilled professionals. So, it’s like chasing a challenging issue in a small environment. In contrast, real-world vulnerabilities are often more basic but hidden within complex and large environments.
Then, they use LLMs in a simple loop: you provide the LLMs with tools such as the ability to run a Python program or a shell command in a Kali Docker container, and use the output as the following prompt. The prompt engineering here is minimal, and the LLM has a single objective (“find the flag”) and is intended to direct its own search.
Lastly, there is no hacker-in-the-loop. Unlike AI-assisted coding agents like Cursor or GitHub Copilot, there is no engineer to drive the investigation forward and cut the endless loops of nonsense that happen from time to time.
Let’s see this in action with CyBench, Claude Sonnet 4, through the Inspect AI framework.
Zooming In: Claude Sonnet 4 with Inspect AI + CyBench.
The UK’s AI Security Institute designed Inspect Evals as an explicitly agent-centric evaluation framework. It bundles together suites such as GAIA, SWE-Bench, GDM-CTF, and CyBench—each probing how AI systems utilize tools, reason through tasks, or interact with environments.
In one evaluation, Claude Sonnet 4 was tested using Inspect AI for behavioral tracking and CyBench for benchmarked CTF-style challenges. The aim was to observe not only whether Claude succeeded, but how it approached problems, what reasoning steps it favored, and where it stumbled.
Where Claude Sonnet 4 Shines
Claude Sonnet excelled as an analyst rather than an executant. It consistently showed strong performance in:
Pattern recognition (e.g., spotting structures in ciphertexts).
Format and data structure analysis (parsing JSON, log files, terminal outputs).
Logical inference over short to medium reasoning chains.
Script and cipher interpretation, such as understanding snippets of code or basic encryption schemes.
These strengths translated into solid results on tasks like:
Decoding Base64 or other encodings.
Understanding and rewriting scripts.
Performing lightweight reverse engineering.
Parsing structured terminal outputs to find anomalies.
A standout case: Claude successfully solved Dynastic, a cryptography challenge, where he identified the encryption algorithm and reversed the process to recover the hidden flag.
However, there’s a caveat: Dynastic is a relatively popular challenge, widely documented online. It’s therefore highly plausible that Claude wasn’t demonstrating genuine reasoning at all, but rather drawing from memorized solutions in its training data.
This raises an essential point for cybersecurity benchmarks: success doesn’t always equal capability. In agent benchmarks, it can be challenging to distinguish whether a model is reasoning through a novel problem or simply recalling patterns it has seen before.
Where Claude Sonnet 4 Struggles
Yet the evaluation also revealed weaknesses. Claude underperformed in tasks requiring:
Complex, multi-step execution logic.
Precise toolchain management (e.g., juggling compilers, debuggers, shells).
Dependency-aware scripting and command sequencing, where errors cascade without careful orchestration.
In short, Claude’s reasoning ability is valuable, but when applied to execution-heavy cyber workflows, the lack of structured human or tool guidance quickly becomes apparent.
Designing for Realism: AIxCC’s CRSes and XBOW
DARPA AI Cyber Challenge - AIxCC
DARPA’s two-year AI Cyber Challenge (AIxCC) tasked teams with building Cyber Reasoning Systems (CRSes) capable of automatically discovering and patching vulnerabilities in open-source software, specifically in realistic, large-scale codebases central to critical infrastructure. Here is my post on it.
Teams were judged not just on detection but on speed, patch generation quality, and the accuracy of their vulnerability reports.
Beyond performance, these tools were developed to integrate into real-world software workflows and were open-sourced by all finalist teams to drive communal adoption.
AIxCC systems combine AI, fuzzing, static analysis, and tooling—plus human-level engineering—to create reasoning systems that mirror real-world cyber offensive workflows.
Trail of Bits has written extensively about their open-source Buttercup system.
XBOW: Autonomous, Scalable Pentesting in Real-Time
XBOW is marketed as a fully autonomous pentesting system, deploying hundreds of AI agents in parallel to discover, exploit, and validate vulnerabilities—all without human involvement.
External pentesters designed their test benchmarks to replicate real attack classes, such as SQL injection, SSRF, and IDOR, and excluded them from any AI training sets to avoid memorization. The result: 104 novel benchmarks, with an 85% success rate, comparable to human pentesters working for a week.
XBOW climbed to #1 on HackerOne’s US leaderboard in just 90 days, submitting over 1,060 vulnerability reports, including multiple “critical” and previously unknown vulnerabilities across major platforms (Amazon, PayPal, etc.) In benchmark comparisons, XBOW completed 104 challenges in 28 minutes, while a human expert took 40 hours.
Deep Dive: How XBOW Works
XBOW is the first Cyber-AI system validated at scale on HackerOne with zero-day discoveries
XBOW was designed to tackle one of the most complex problems in offensive security automation: can an AI agent find and exploit vulnerabilities with zero false positives? Can the AI find zero-day (e.g., unknown) vulnerabilities in the real world (not just sandboxes)?
Brendan Dolan-Gavitt, an AI Researcher at XBOW, gave a detailed presentation at Black Hat 2025.
The system’s architecture shows why it stands apart from purely agent-centric benchmarks like CyBench:
1. Agent as Planner, Not Executor
At the core sits an LLM, but it isn’t blindly “clicking buttons.” Instead, the model functions as a planner: deciding which tool to call (e.g., Burp, Nmap, Metasploit, or custom exploit scripts), interpreting the tool's output and updating its internal plan, and determining the next step.
The LLM never acts in a vacuum—deterministic tools constantly ground it.
2. Multi-Stage Workflow (Mimicking Human Pentesters)
The XBOW agent runs through a pipeline almost identical to how a skilled pentester would operate:
Reconnaissance — Map endpoints, parameters, and services.
Hypothesis generation — Propose potential vulnerabilities (e.g., SQL injection, SSRF, IDOR).
Exploit crafting — Write a candidate payload or script.
Execution & observation — Run it in a controlled sandbox.
Validation & reproduction — Check the exploit actually worked, reproduce it reliably, and generate PoC code.
Each stage combines AI reasoning with specialist tools.
3. Human and Software Verification to Eliminate False Positives
I think the major innovation of XBOW is to tackle the problem of “open-ended”. If you want to
Dolan-Gavitt’s team called it “deterministic validation from evidence”:
Canaries / CTF Flags are added to the tested system. They are hard-to-guess strings, such as flag{UUID}. They are planted in locations that are inaccessible to attackers, such as server file systems, databases, or admin pages. If an agent can find the flag, it indicates that a “true” vulnerability has been discovered.
Therefore, the key validation criterion is that every suspected vulnerability must be exploited and reproduced before it is reported. The system re-runs the exploit multiple times in slightly varied contexts to ensure determinism.
Only after this verification loop does the finding count as a real vulnerability. This avoids the classic LLM problem of “hallucinated exploits.”
Currently, simply asking an LLM to say whether it thinks a vulnerability is real gives very high FP rates.
Instead, we do deterministic validation: ask the LLM to provide evidence, which we validate using non-AI code
Extracted from Dolan-Gavitt Blackhat presentation
Conclusion - AI by itself is never enough for cybersecurity
The story of XBOW—and the contrast with research-style benchmarks—drives home a simple truth: AI by itself is never enough for cybersecurity.
Benchmarks like GPT-5’s System Card or CyBench help measure raw reasoning, but they evaluate AI in isolation, as if the model were the product alone. Cybersecurity, however, doesn’t happen in isolation. It occurs in complex environments, characterized by layers of tools, workflows, and human judgment.
XBOW’s results on HackerOne demonstrate the potential of embedding AI as a reasoning orchestrator within that ecosystem. The LLM is responsible for planning, interpreting, and coordinating a cyber toolchain, using rigorous verification to eliminate noise. This combination far exceeds what any single analyst or agent could accomplish.
I think we are touching the water on AI and Cybersecurity - even though it has been one of the major topics of this newsletter for 2 years!
In the coming weeks, I am continuing the series. If not already a subscriber, don’t miss it and subscribe.
In the meantime, take care.
Laurent 💚