←back to thread

425 points karimf | 1 comments | | HN request time: 0s | source
Show context
orena ◴[] No.45664511[source]
How many epochs did you train with ? 100k hours is not a lot for an LLM, Feels like bitter lesson
replies(1): >>45665933 #
vvolhejn ◴[] No.45665933[source]
I train for 1M steps (batch size 64, block size 2048), which is enough for the model to more-or-less converge.

It's also a tiny model for LLM standards, with 150M parameters. The goal wasn't really to reach state of the art but to show how the performance of a single language model architecture can be vastly different when you just change the tokenizer.

replies(1): >>45666063 #
1. singularfutur ◴[] No.45666063{3}[source]
To get around state of the art, how many parameters would be needed with your approach?