>European versions of ARC
But this is an image-like benchmark. Has anyone looked at the article about the EU-ARC, what is the difference? Why can't you measure it on a regular one?
I glanced through it, didn't find it right away, but judging by their tokenizer, they are learning from scratch. In general, I don't like this approach for the task at hand. For large languages, there are already good models that they don't want to compare with. And for low-resource languages, it is very important to take more languages from this language group, which are not necessarily part of the EU
replies(2):