←back to thread

524 points andy99 | 1 comments | | HN request time: 0.24s | source
Show context
WeirderScience ◴[] No.44536327[source]
The open training data is a huge differentiator. Is this the first truly open dataset of this scale? Prior efforts like The Pile were valuable, but had limitations. Curious to see how reproducible the training is.
replies(2): >>44536400 #>>44537249 #
layer8 ◴[] No.44536400[source]
> The model will be fully open: source code and weights will be publicly available, and the training data will be transparent and reproducible

This leads me to believe that the training data won’t be made publicly available in full, but merely be “reproducible”. This might mean that they’ll provide references like a list of URLs of the pages they trained on, but not their contents.

replies(3): >>44536448 #>>44536623 #>>44536818 #
1. WeirderScience ◴[] No.44536448[source]
Yeah, I suspect you're right. Still, even a list of URLs for a frontier model (assuming it does turn out to be of that level) would be welcome over the current situation.