←back to thread

770 points ta988 | 10 comments | | HN request time: 0s | source | bottom
1. Ukv ◴[] No.42550989[source]
Are these IPs actually from OpenAI/etc. (https://openai.com/gptbot.json), or is it possibly something else masquerading as these bots? The real GPTBot/Amazonbot/etc. claim to obey robots.txt, and switching to a non-bot UA string seems extra questionable behaviour.
replies(2): >>42551196 #>>42563566 #
2. equestria ◴[] No.42551196[source]
I exclude all the published LLM User-Agents and have a content honeypot on my website. Google obeys, but ChatGPT and Bing still clearly know the content of the honeypot.
replies(3): >>42551318 #>>42551321 #>>42551783 #
3. Ukv ◴[] No.42551318[source]
Interesting - do you have a link?
replies(1): >>42551698 #
4. jonnycomputer ◴[] No.42551321[source]
how do you determine that they know the content of the honeypot?
replies(1): >>42551553 #
5. arrowsmith ◴[] No.42551553{3}[source]
Presumably the "honeypot" is an obscured link that humans won't click (e.g. tiny white text on a white background in a forgotten corner of the page) but scrapers will. Then you can determine whether a given IP visited the link.
replies(2): >>42551731 #>>42551733 #
6. equestria ◴[] No.42551698{3}[source]
Of course, but I'd rather not share it for obvious reasons. It is a nonsensical biography of a non-existing person.
7. 55555 ◴[] No.42551731{4}[source]
I interpreted it to mean that a hidden page (linked as u describe) is indexed in Bing or that some "facts" written on a hidden page are regurgitated by ChatGPT.
8. jonnycomputer ◴[] No.42551733{4}[source]
I know what a honeypot is, but the question is how the know the scraped data was actually used to train llms. I wondered whether they discovered or verified that by getting the llm to regurgitate content from the honeypot.
9. pogue ◴[] No.42551783[source]
What's the purpose of the honeypot? Poisoning the LLM or identifying useragents/IPs that shouldn't be seeing it?
10. anonnon ◴[] No.42563566[source]
I don't trust OpenAI, and I don't know why anyone else would at this point.