I got excited by reading the article about releasing the training data, went to their HF account to look at the data (dolma3) and first rows? Text scraped from porn websites!
replies(2):
That said I like to think of it was my dataset I would have shuffled that part down in the list so it didn’t show up on the hf preview