←back to thread

255 points ColinWright | 2 comments | | HN request time: 0.44s | source
1. throw_me_uwu ◴[] No.45780502[source]
> most likely trying to non-consensually collect content for training LLMs

No, it's just background internet scanning noise

replies(1): >>45780595 #
2. lucasluitjes ◴[] No.45780595[source]
This.

If you were writing a script to mass-scan the web for vulnerabilities, you would want to collect as many http endpoints as possible. JS files, regardless of whether they're commented out or not, are a great way to find endpoints in modern web applications.

If you were writing a scraper to collect source code to train LLMs on, I doubt you would care as much about a commented-out JS file. I'm not sure you'd even want to train on random low-quality JS served by websites. Anyone familiar with LLM training data collection who can comment on this?