> But in recent months I've spoken to other YC founders doing AI application startups [...] in different industries, on different problem sets.
Maybe they should create a benchmark collectively called YC founders. Gather various test cases. Never make it public. And use that to evaluate newly released models.