How do all these models compare to each other?
Are there any metrics that can tell me how much better or worse LLaMA is compared to GPT3?
What does it even mean to be better?