Definitely weird results, but I feel there are too many variables to learn much from it. A couple things:
1. The author mentioned that tokenization causes something minuscule like a a " " at the end of the input to shatter the model's capabilities. Is it possible other slightly different formatting changes in the input could raise capabilities?
2. Temperature was 0.7 for all models. What if it wasn't? Isn't there a chance one more more models would perform significantly better with higher or lower temperatures?
Maybe I just don't understand this stuff very well, but it feels like this post is only 10% of the work needed to get any meaning from this...
replies(1):