How many epochs did you train with ? 100k hours is not a lot for an LLM, Feels like bitter lesson
replies(1):
It's also a tiny model for LLM standards, with 150M parameters. The goal wasn't really to reach state of the art but to show how the performance of a single language model architecture can be vastly different when you just change the tokenizer.