DE version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
100% Positive
Analyzed from 318 words in the discussion.
Trending Topics
#models#glm#coding#reasoning#testing#evaluation#play#games#code#game

Discussion (7 Comments)Read Original on HackerNews
Comprehensive evaluation results at https://gertlabs.com/rankings
>Comprehensive evaluation results at https://gertlabs.com/rankings
But if you go to the linked site, it seems like the only thing that's part of the evaluation is how well the models play various games? I suppose that counts as "reasoning", but I don't see how coding ability tested?
Coding is what we test for most heavily. Testing this via a game format (instead of correct/incorrect answers) allows us to score code objectively, scale to smarter models, and directly compare performance to other models. When we built the first iteration last year, I was surprised by how well it mapped to subjective experience with using models for coding. Games really are great for measuring intelligence.