interesting idea, this benchmark maps fairly closely to the types of output I typically ask LLMs to generate for me day-to-day
/vote: Your prompt will be answered by four random, anonymous models. You pick the one you prefer and crown the winner, tournament-style.
/leaderboard: See the current winning models, as dictated by voter preferences.
/play: Iterate quickly by seeing four models respond to the same input and pressing space to regenerate the results you don’t lock-in.
We were especially impressed with the quality of DeepSeek and Grok, and variance between categories (To judge by the results so far, OpenAI is very good for game dev, but seems to suck everywhere else).
We’ve learned a lot, and are curious to hear your comments and questions. Excited to make this better!