XBOW, an autonomous penetration tester, has reached the top spot on HackerOne

1. hinterlands ◴[24 Jun 25 21:25 UTC] No.44371233[source]▶

Xbow has really smart people working on it, so they're well-aware of the usual 30-second critiques that come up in this thread. For example, they take specific steps to eliminate false positives.

The #1 spot in the ranking is both more of a deal and less of a deal than it might appear. It's less of a deal in that HackerOne is an economic numbers game. There are countless programs you can sign up for, with varied difficulty levels and payouts. Most of them pay not a whole lot and don't attract top talent in the industry. Instead, they offer supplemental income to infosec-minded school-age kids in the developing world. So I wouldn't read this as "Xbow is the best bug hunter in the US". That's a bit of a marketing gimmick.

But this is also not a particularly meaningful objective. The problem is that there's a lot of low-hanging bugs that need squashing and it's hard to allocate sufficient resources to that. Top infosec talent doesn't want to do it (and there's not enough of it). Consulting companies can do it, but they inevitably end up stretching themselves too thin, so the coverage ends up being hit-and-miss. There's a huge market for tools that can find easy bugs cheaply and without too many false positives.

I personally don't doubt that LLMs and related techniques are well-tailored for this task, completely independent of whether they can outperform leading experts. But there are skeptics, so I think this is an important real-world result.

replies(5): >>44371286 #>>44371525 #>>44372166 #>>44372632 #>>44375954 #

2. absurdo ◴[24 Jun 25 21:31 UTC] No.44371286[source]▶

>>44371233 (TP) #

> so they're well-aware of the usual 30-second critiques that come up in this thread.

Succinct description of HN. It’s a damn shame.

3. normie3000 ◴[24 Jun 25 22:01 UTC] No.44371525[source]▶

>>44371233 (TP) #

> Top infosec talent doesn't want to do it (and there's not enough of it).

What is the top talent spending its time on?

replies(6): >>44371808 #>>44371813 #>>44373246 #>>44373714 #>>44375676 #>>44376223 #

4. hinterlands ◴[24 Jun 25 22:40 UTC] No.44371808[source]▶

>>44371525 #

Vulnerability researchers? For public projects, there's a strong preference for prestige stuff: ecosystem-wide vulnerabilities, new attack techniques, attacking cool new tech (e.g., self-driving cars).

To pay bills: often working for tier A tech companies on intellectually-stimulating projects, such as novel mitigations, proprietary automation, etc. Or doing lucrative consulting / freelance work. Generally not triaging Nessus results 9-to-5.

5. tptacek ◴[24 Jun 25 22:40 UTC] No.44371813[source]▶

>>44371525 #

Specialized bug-hunting.

6. bgwalter ◴[24 Jun 25 23:27 UTC] No.44372166[source]▶

>>44371233 (TP) #

Maybe that is because the article is chaotic (like any "AI" article) and does not really address the false positive issue in a well.presented manner? Or even at all?

Below people are reading the tea leaves to get any clue.

replies(1): >>44375966 #

7. Sytten ◴[25 Jun 25 00:47 UTC] No.44372632[source]▶

>>44371233 (TP) #

100% agree with OP, to make a living in BBH you can't go hunting on VDP program that don't pay anything all day. That means you will have a lot of low hanging fruits on those programs.

I don't think LLM replace humans, they do free up time to do nicer tasks.

replies(1): >>44377186 #

8. UltraSane ◴[25 Jun 25 02:49 UTC] No.44373246[source]▶

>>44371525 #

The best paying bug bounties.

9. atemerev ◴[25 Jun 25 04:51 UTC] No.44373714[source]▶

>>44371525 #

"A bolt cutter pays for itself starting from the second bike"

10. mr_mitm ◴[25 Jun 25 10:32 UTC] No.44375676[source]▶

>>44371525 #

Working from 9 to 5 for a guaranteed salary that is not dependent on how many bugs you find before anybody else, and not having to argue your case or negotiate the bounty.

11. moomin ◴[25 Jun 25 11:16 UTC] No.44375954[source]▶

>>44371233 (TP) #

Honestly I think this is extremely impressive, but it also raises what I call the “junior programmer” problem. Say XBOW gets good enough to hoover up basically all that money and can do it cost-effectively. What then happens to the pipeline of security researchers?

12. moomin ◴[25 Jun 25 11:19 UTC] No.44375966[source]▶

>>44372166 #

There’s two whole paragraphs under a dedicated heading. I don’t think the problem is with the article here. Paragraphs reproduced below:

AI can be remarkably effective at discovering a broad range of vulnerabilities—but the real challenge isn’t always detection, It’s precision. Automation has long struggled with false positives, and nowhere is this more evident than in vulnerability scanning. Tools that flag dozens of irrelevant issues often create more work than they save. When AI enters the equation, the stakes grow even higher: models can generalize well, but verifying technical edge cases is a different game entirely.

To ensure accuracy, we developed the concept of validators, automated peer reviewers that confirm each vulnerability XBOW uncovers. Sometimes this process leverages a large language model; in other cases, we build custom programmatic checks. For example, to validate Cross-Site Scripting findings, a headless browser visits the target site to verify that the JavaScript payload was truly executed. (don’t miss Brendan Dolan-Gavitt’s BlackHat presentation on AI agents for Offsec)

replies(1): >>44376227 #

13. kalium-xyz ◴[25 Jun 25 11:53 UTC] No.44376223[source]▶

>>44371525 #

From my experience they work on random person projects 90% of their time

14. eeeeeeehio ◴[25 Jun 25 11:54 UTC] No.44376227{3}[source]▶

>>44375966 #

This doesn't say anything about many false positives they actually have. Yes, you can write other programs (that might even invoke another LLM!) to "check" the findings. That's a very obvious and reasonable thing to do. But all "vulnerability scanners", AI or not, must take steps to avoid FP -- that doesn't tell us how well they actually work.

The glaring omission here is a discussion of how many bugs the XBOW team had to manually review in order to make ~1k "valid" submissions. They state:

> It was a unique privilege to wake up each morning and review creative new exploits.

How much of every morning was spent reviewing exploits? And what % of them turned out to be real bugs? These are the critical questions that are (a) is unanswered by this post, and (b) determine the success of any product in this space imo.

15. skeeter2020 ◴[25 Jun 25 13:35 UTC] No.44377186[source]▶

>>44372632 #

...which is exactly what technology advancements in our field have done since its inception, vs. the "this changes everything for everybody forever" narative that makes AI cheerleaders so exhausting.