←back to thread

549 points andy99 | 1 comments | | HN request time: 0.205s | source
Show context
k__ ◴[] No.44536047[source]
"respecting web crawling opt-outs during data acquisition produces virtually no performance degradation"

Great to read that!

replies(3): >>44536377 #>>44538760 #>>44539036 #
Onavo ◴[] No.44536377[source]
No performance degradation on training metrics except for the end user. At the end of the day users and website owners have completely orthogonal interests. Users want answers and content, website owners want attention so they can upsell/push ads. You can only serve one master.
replies(1): >>44536456 #
esafak ◴[] No.44536456[source]
> Users want answers and content, website owners want attention so they can upsell/push ads. You can only serve one master

How are you going to serve users if web site owners decide to wall their content? You can't ignore one side of the market.

replies(1): >>44536786 #
Onavo ◴[] No.44536786[source]
You don't. You bypass them with crawlers and don't reveal your training data. And this is exactly why open source models can't surpass open weight models.
replies(1): >>44537425 #
diggan ◴[] No.44537425[source]
> And this is exactly why open source models can't surpass open weight models.

It is a fair point, but how strong of a point it is remains to be seen, some architectures are better than others, even with the same training data, so not impossible we could at one point see some innovative architectures beating current proprietary ones. It would probably be short-lived though, as the proprietary ones would obviously improve in their next release after that.

replies(1): >>44537860 #
jowea ◴[] No.44537860[source]
How can open source models respectful of robots.txt possibly perform equally if they are missing information that the other models have access to?
replies(3): >>44538826 #>>44539844 #>>44540399 #
datameta ◴[] No.44538826[source]
How can we possibly find out without trying?
replies(1): >>44538893 #
jowea ◴[] No.44538893[source]
It is logically impossible for a LLM to, for example, to know that fooExecute() takes two int arguments if the documentation is blocked by robots.txt and there are no examples of fooExecute() usage in the wild, don't you agree?
replies(2): >>44539576 #>>44540377 #
1. diggan ◴[] No.44540377[source]
I agree, but also think it's less important. I don't want a big fat LLM that memorized every API out there, and as soon as the API changed, the weights have to updated. I like the current approach of Codex (and similar) where they can look up the APIs they need to use as they're doing the work instead, so same weights will continue to work no matter how much the APIs change.