Most active commenters

bflesch(5)
grahamj(3)

Popular/hot comments

>>42727906 #

←back to thread

Nepenthes is a tarpit to catch AI web crawlers

(zadzmo.org)

Show context

bflesch ◴[16 Jan 25 15:46 UTC] No.42726827[source]▶

>>42725147 (OP) #

Haha, this would be an amazing way to test the ChatGPT crawler reflective DDOS vulnerability [1] I published last week.

Basically a single HTTP Request to ChatGPT API can trigger 5000 HTTP requests by ChatGPT crawler to a website.

The vulnerability is/was thoroughly ignored by OpenAI/Microsoft/BugCrowd but I really wonder what would happen when ChatGPT crawler interacts with this tarpit several times per second. As ChatGPT crawler is using various Azure IP ranges I actually think the tarpit would crash first.

The vulnerability reporting experience with OpenAI / BugCrowd was really horrific. It's always difficult to get attention for DOS/DDOS vulnerabilities and companies always act like they are not a problem. But if their system goes dark and the CEO calls then suddenly they accept it as a security vulnerability.

I spent a week trying to reach OpenAI/Microsoft to get this fixed, but I gave up and just published the writeup.

I don't recommend you to exploit this vulnerability due to legal reasons.

[1] https://github.com/bf/security-advisories/blob/main/2025-01-...

replies(12): >>42727288 #>>42727356 #>>42727528 #>>42727530 #>>42733203 #>>42733949 #>>42738239 #>>42742714 #>>42748667 #>>42777344 #>>42777350 #>>42792278 #

1. JohnMakin ◴[16 Jan 25 16:16 UTC] No.42727288[source]▶

>>42726827 #

Nice find, I think one of my sites actually got recently hit by something like this. And yea, this kind of thing should be trivially preventable if they cared at all.

replies(2): >>42727906 #>>42731618 #

2. dewey ◴[16 Jan 25 17:06 UTC] No.42727906[source]▶

>>42727288 (TP) #

> And yea, this kind of thing should be trivially preventable if they cared at all.

Most of the time when someone says something is "trivial" without knowing anything about the internals, it's never trivial.

As someone working close to the b2c side of a business, I can’t count the amount of times I've heard that something should be trivial while it's something we've thought about for years.

replies(4): >>42728034 #>>42728078 #>>42728234 #>>42729816 #

3. grahamj ◴[16 Jan 25 17:14 UTC] No.42728034[source]▶

>>42727906 #

If you’re unable to throttle your own outgoing requests you shouldn’t be making any

replies(1): >>42728110 #

4. bflesch ◴[16 Jan 25 17:17 UTC] No.42728078[source]▶

>>42727906 #

The technical flaws are quite trivial to spot, if you have the relevant experience:

- urls[] parameter has no size limit

- urls[] parameter is not deduplicated (but their cache is deduplicating, so this security control was there at some point but is ineffective now)

- their requests to same website / DNS / victim IP address rotate through all available Azure IPs, which gives them risk of being blocked by other hosters. They should come from the same IP address. I noticed them changing to other Azure IP ranges several times, most likely because they got blocked/rate limited by Hetzner or other counterparties from which I was playing around with this vulnerabilities.

But if their team is too limited to recognize security risks, there is nothing one can do. Maybe they were occupied last week with the office gossip around the sexual assault lawsuit against Sam Altman. Maybe they still had holidays or there was another, higher-risk security vulnerability.

Having interacted with several bug bounties in the past, it feels OpenAI is not very mature in that regard. Also why do they choose BugCrowd when HackerOne is much better in my experience.

replies(1): >>42728271 #

5. bflesch ◴[16 Jan 25 17:20 UTC] No.42728110{3}[source]▶

>>42728034 #

I assume it'll be hard for them to notice because it's all coming from Azure IP ranges. OpenAI has very big credit card behind this Azure account so this vulnerability might only be limited by Azure capacity.

I noticed they switched their crawler to new IP ranges several times, but unfortunately Microsoft CERT / Azure security team didn't answer to my reports.

If this vulnerability is exploited, it hits your server with MANY requests per second, right from the hearts of Azure cloud.

replies(1): >>42728152 #

6. grahamj ◴[16 Jan 25 17:23 UTC] No.42728152{4}[source]▶

>>42728110 #

Note I said outgoing, as in the crawlers should be throttling themselves

replies(1): >>42728310 #

7. ◴[16 Jan 25 17:30 UTC] No.42728234[source]▶

>>42727906 #

8. fc417fc802 ◴[16 Jan 25 17:34 UTC] No.42728271{3}[source]▶

>>42728078 #

> rotate through all available Azure IPs, ... They should come from the same IP address.

I would guess that this is intentional, intended to prevent IP level blocks from being effective. That way blocking them means blocking all of Azure. Too much collateral damage to be worth it.

replies(1): >>42737651 #

9. bflesch ◴[16 Jan 25 17:37 UTC] No.42728310{5}[source]▶

>>42728152 #

Sorry for misunderstanding your point.

I agree it should be throttled. Maybe they don't need to throttle because they don't care about cost.

Funny thing is that servers from AWS were trying to connect to my system when I played around with this - I assume OpenAI has not moved away from AWS yet.

Also many different security scanners hitting my IP after every burst of incoming requests from the ChatGPT crawler Azure IP ranges. Quite interesting to see that there are some proper network admins out there.

replies(2): >>42729758 #>>42729871 #

10. grahamj ◴[16 Jan 25 19:27 UTC] No.42729758{6}[source]▶

>>42728310 #

yeah it’s fun out on the wild internet! Thankfully I don’t manage something thing crawlable anymore but even so the endpoint traffic is pretty entertaining sometimes.

What would keep me up at night if I was still more on the ops side is “computer use” AI that’s virtually indistinguishable from a human with a browser. How do you keep the junk away then?

11. jillyboel ◴[16 Jan 25 19:33 UTC] No.42729816[source]▶

>>42727906 #

now try to reply to the actual content instead of some generalizing grandstanding bullshit

12. jillyboel ◴[16 Jan 25 19:38 UTC] No.42729871{6}[source]▶

>>42728310 #

They need to throttle because otherwise they're simply a DDoS service. It's clear they don't give a fuck though, like any bigtech company. They'll spend millions on prosecuting anyone who dares to do what they perceive as a DoS attack against them, but they'll spit in your face and laugh at you if you even dare to claim they are DDoSing you.

13. zanderwohl ◴[16 Jan 25 22:16 UTC] No.42731618[source]▶

>>42727288 (TP) #

IDK, I feel that if you're doing 5000 HTTP calls to another website it's kind of good manners to fix that. But OpenAI has never cared about the public commons.

replies(2): >>42731695 #>>42739329 #

14. marginalia_nu ◴[16 Jan 25 22:24 UTC] No.42731695[source]▶

>>42731618 #

Yeah, even beyond common decency, there's pretty strong incentives to fix it, as it's a fantastic way of having your bot's fingerprint end up on Cloudflare's shitlist.

replies(1): >>42741909 #

15. jackcviers3 ◴[17 Jan 25 14:10 UTC] No.42737651{4}[source]▶

>>42728271 #

It is. There are scraping third party services you can pay for that will do all of this for you, and getting blocked by IP. You then make your request to the third-party scraper, receive the contents, and do with them whatever you need to do.

16. chefandy ◴[17 Jan 25 16:08 UTC] No.42739329[source]▶

>>42731618 #

Nobody in this space gives a fuck about anyone outside of the people paying for their top-tier services, and even then, they only care about them when their bill is due. They don't care about their regular users, don't care about the environment, don't care about the people that actually made the "data" they're re-selling... nobody.

17. bflesch ◴[17 Jan 25 18:52 UTC] No.42741909{3}[source]▶

>>42731695 #

Kinda disappointed by cloudflare - it feels they have quite basic logic only. Why would anomaly detection not capture these large payloads?

There was a zip-bomb like attack a year ago where you could send one gigabyte of the letter "A" compressed into very small filesize with brotli via cloudflare to backend servers, basically something like the old HTTP Transfer-Encoding (which has been discontinued).

Attacker --1kb--> Cloudflare --1GB--> backend server

Obviously the servers who received the extracted HTTP request from the cloudflare web proxies were getting killed but cloudflare didn't even accept it as a valid security problem.

AFAIK there was no magic AI security monitoring anomaly detection thing which blocked anything. Sometimes I'd love to see the old web application firewall warnings for single and double quotes just to see if the thing is still there. But maybe it's misconfiguration on side of cloudflare user because I can remember they at least had a WAF product in the past.

replies(1): >>42745097 #

18. benregenspan ◴[18 Jan 25 01:52 UTC] No.42745097{4}[source]▶

>>42741909 #

> But maybe it's misconfiguration on side of cloudflare user because I can remember they at least had a WAF product in the past

They still have a WAF product, though I don't think anything in the standard managed ruleset will fire just on quotes, the SQLi and XSS checks are a bit more sophisticated than that.

From personal experience, they will fire a lot if someone uses a WAF-protected CMS to write a post about SQL.

↑