(ethz.ch)

514 points andy99 | 2 comments | 11 Jul 25 18:45 UTC | HN request time: 0.52s | source

Show context

k__ ◴[11 Jul 25 19:32 UTC] No.44536047[source]▶

>>44535637 (OP) #

"respecting web crawling opt-outs during data acquisition produces virtually no performance degradation"

Great to read that!

replies(3): >>44536377 #>>44538760 #>>44539036 #

1. JKCalhoun ◴[12 Jul 25 02:19 UTC] No.44538760[source]▶

>>44536047 #

Is there not yet a Source where the web has already been scraped and souped down to just the text? It would seem someone would have created such a thing in order to save LLM training from having to reinvent the wheel.

I understand the web is a dynamic thing but still it would seem to be useful on some level.

replies(1): >>44540972 #

2. CaptainFever ◴[12 Jul 25 10:32 UTC] No.44540972[source]▶

>>44538760 (TP) #

Common Crawl, maybe?

↑

ETH Zurich and EPFL to release a LLM developed on public infrastructure