XBOW, an autonomous penetration tester, has reached the top spot on HackerOne

(xbow.com)

283 points summarity | 2 comments | 24 Jun 25 15:53 UTC | HN request time: 0.55s | source

Show context

hinterlands ◴[24 Jun 25 21:25 UTC] No.44371233[source]▶

Xbow has really smart people working on it, so they're well-aware of the usual 30-second critiques that come up in this thread. For example, they take specific steps to eliminate false positives.

The #1 spot in the ranking is both more of a deal and less of a deal than it might appear. It's less of a deal in that HackerOne is an economic numbers game. There are countless programs you can sign up for, with varied difficulty levels and payouts. Most of them pay not a whole lot and don't attract top talent in the industry. Instead, they offer supplemental income to infosec-minded school-age kids in the developing world. So I wouldn't read this as "Xbow is the best bug hunter in the US". That's a bit of a marketing gimmick.

But this is also not a particularly meaningful objective. The problem is that there's a lot of low-hanging bugs that need squashing and it's hard to allocate sufficient resources to that. Top infosec talent doesn't want to do it (and there's not enough of it). Consulting companies can do it, but they inevitably end up stretching themselves too thin, so the coverage ends up being hit-and-miss. There's a huge market for tools that can find easy bugs cheaply and without too many false positives.

I personally don't doubt that LLMs and related techniques are well-tailored for this task, completely independent of whether they can outperform leading experts. But there are skeptics, so I think this is an important real-world result.

replies(5): >>44371286 #>>44371525 #>>44372166 #>>44372632 #>>44375954 #

bgwalter ◴[24 Jun 25 23:27 UTC] No.44372166[source]▶

>>44371233 #

Maybe that is because the article is chaotic (like any "AI" article) and does not really address the false positive issue in a well.presented manner? Or even at all?

Below people are reading the tea leaves to get any clue.

replies(1): >>44375966 #

1. moomin ◴[25 Jun 25 11:19 UTC] No.44375966[source]▶

>>44372166 #

There’s two whole paragraphs under a dedicated heading. I don’t think the problem is with the article here. Paragraphs reproduced below:

AI can be remarkably effective at discovering a broad range of vulnerabilities—but the real challenge isn’t always detection, It’s precision. Automation has long struggled with false positives, and nowhere is this more evident than in vulnerability scanning. Tools that flag dozens of irrelevant issues often create more work than they save. When AI enters the equation, the stakes grow even higher: models can generalize well, but verifying technical edge cases is a different game entirely.

To ensure accuracy, we developed the concept of validators, automated peer reviewers that confirm each vulnerability XBOW uncovers. Sometimes this process leverages a large language model; in other cases, we build custom programmatic checks. For example, to validate Cross-Site Scripting findings, a headless browser visits the target site to verify that the JavaScript payload was truly executed. (don’t miss Brendan Dolan-Gavitt’s BlackHat presentation on AI agents for Offsec)

replies(1): >>44376227 #

2. eeeeeeehio ◴[25 Jun 25 11:54 UTC] No.44376227[source]▶

>>44375966 (TP) #

This doesn't say anything about many false positives they actually have. Yes, you can write other programs (that might even invoke another LLM!) to "check" the findings. That's a very obvious and reasonable thing to do. But all "vulnerability scanners", AI or not, must take steps to avoid FP -- that doesn't tell us how well they actually work.

The glaring omission here is a discussion of how many bugs the XBOW team had to manually review in order to make ~1k "valid" submissions. They state:

> It was a unique privilege to wake up each morning and review creative new exploits.

How much of every morning was spent reviewing exploits? And what % of them turned out to be real bugs? These are the critical questions that are (a) is unanswered by this post, and (b) determine the success of any product in this space imo.

↑