Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data

Lots of assumption here. First, that we will only be training on text data, if we take into considerations all the videos and audios shared I am quite sure we would have one or two orders of magnitude more of data. Second, that it even matter, there has been some early research showing that training on the right data improves prediction more than training on more data (which intuitively makes sense, training on papers and book is much more useful than training on youtube comments). Additionally, lots of the improvement in quality are because of RLHF, which is basically manual human labeling. And last, my guess is that improvements in architecture are what will unlock the next level of performance, not just scaling.