(pod.geraspora.de)

770 points ta988 | 4 comments | 30 Dec 24 14:37 UTC | HN request time: 0.894s | source

Show context

Ukv ◴[30 Dec 24 16:49 UTC] No.42550989[source]▶

Are these IPs actually from OpenAI/etc. (https://openai.com/gptbot.json), or is it possibly something else masquerading as these bots? The real GPTBot/Amazonbot/etc. claim to obey robots.txt, and switching to a non-bot UA string seems extra questionable behaviour.

replies(2): >>42551196 #>>42563566 #

equestria ◴[30 Dec 24 17:09 UTC] No.42551196[source]▶

>>42550989 #

I exclude all the published LLM User-Agents and have a content honeypot on my website. Google obeys, but ChatGPT and Bing still clearly know the content of the honeypot.

replies(3): >>42551318 #>>42551321 #>>42551783 #

1. jonnycomputer ◴[30 Dec 24 17:22 UTC] No.42551321[source]▶

>>42551196 #

how do you determine that they know the content of the honeypot?

replies(1): >>42551553 #

2. arrowsmith ◴[30 Dec 24 17:43 UTC] No.42551553[source]▶

>>42551321 (TP) #

Presumably the "honeypot" is an obscured link that humans won't click (e.g. tiny white text on a white background in a forgotten corner of the page) but scrapers will. Then you can determine whether a given IP visited the link.

replies(2): >>42551731 #>>42551733 #

3. 55555 ◴[30 Dec 24 17:57 UTC] No.42551731[source]▶

>>42551553 #

I interpreted it to mean that a hidden page (linked as u describe) is indexed in Bing or that some "facts" written on a hidden page are regurgitated by ChatGPT.

4. jonnycomputer ◴[30 Dec 24 17:57 UTC] No.42551733[source]▶

>>42551553 #

I know what a honeypot is, but the question is how the know the scraped data was actually used to train llms. I wondered whether they discovered or verified that by getting the llm to regurgitate content from the honeypot.

↑

AI companies cause most of traffic on forums