ETH Zurich and EPFL to release a LLM developed on public infrastructure

(ethz.ch)

514 points andy99 | 1 comments | 11 Jul 25 18:45 UTC | HN request time: 0.203s | source

Show context

k__ ◴[11 Jul 25 19:32 UTC] No.44536047[source]▶

>>44535637 (OP) #

"respecting web crawling opt-outs during data acquisition produces virtually no performance degradation"

Great to read that!

replies(3): >>44536377 #>>44538760 #>>44539036 #

Onavo ◴[11 Jul 25 20:12 UTC] No.44536377[source]▶

>>44536047 #

No performance degradation on training metrics except for the end user. At the end of the day users and website owners have completely orthogonal interests. Users want answers and content, website owners want attention so they can upsell/push ads. You can only serve one master.

replies(1): >>44536456 #

esafak ◴[11 Jul 25 20:22 UTC] No.44536456[source]▶

>>44536377 #

> Users want answers and content, website owners want attention so they can upsell/push ads. You can only serve one master

How are you going to serve users if web site owners decide to wall their content? You can't ignore one side of the market.

replies(1): >>44536786 #

Onavo ◴[11 Jul 25 21:06 UTC] No.44536786[source]▶

>>44536456 #

You don't. You bypass them with crawlers and don't reveal your training data. And this is exactly why open source models can't surpass open weight models.

replies(1): >>44537425 #

diggan ◴[11 Jul 25 22:30 UTC] No.44537425[source]▶

>>44536786 #

> And this is exactly why open source models can't surpass open weight models.

It is a fair point, but how strong of a point it is remains to be seen, some architectures are better than others, even with the same training data, so not impossible we could at one point see some innovative architectures beating current proprietary ones. It would probably be short-lived though, as the proprietary ones would obviously improve in their next release after that.

replies(1): >>44537860 #

jowea ◴[11 Jul 25 23:29 UTC] No.44537860[source]▶

>>44537425 #

How can open source models respectful of robots.txt possibly perform equally if they are missing information that the other models have access to?

replies(3): >>44538826 #>>44539844 #>>44540399 #

1. Dylan16807 ◴[12 Jul 25 06:34 UTC] No.44539844[source]▶

>>44537860 #

Maybe the missing data makes it 3% worse but the architecture is 5% better. Or your respect for robots.txt gets you more funding and you gain a 4% advantage by training longer.

Don't focus too much on a single variable, especially when all the variables have diminishing returns.

↑