Back to News
Advertisement
Advertisement

⚡ Community Insights

Discussion Sentiment

100% Positive

Analyzed from 318 words in the discussion.

Trending Topics

#models#glm#coding#reasoning#testing#evaluation#play#games#code#game

Discussion (7 Comments)Read Original on HackerNews

gertlabs•about 3 hours ago
GLM-5V-Turbo is a model I wanted to like due to its speed and API reliability, but it didn't perform well in our coding and reasoning testing. More recent open source models have made it obsolete. GLM 5.1 is so many light years ahead of it on everything except speed, that I'm not sure why it's still being served.

Comprehensive evaluation results at https://gertlabs.com/rankings

gruez•about 1 hour ago
>but it didn't perform well in our coding and reasoning testing

>Comprehensive evaluation results at https://gertlabs.com/rankings

But if you go to the linked site, it seems like the only thing that's part of the evaluation is how well the models play various games? I suppose that counts as "reasoning", but I don't see how coding ability tested?

gertlabs•about 1 hour ago
Games is loosely defined here, as we run the bench across hundreds of unique environments. For some, the models write code to play a game, either one-shot or via a harness where they can iterate and use tools. Some they play directly, making a decision on each game tick. Some are real-time, giving the models a harness where they can write code handlers or submit decisions to interact with environments directly.

Coding is what we test for most heavily. Testing this via a game format (instead of correct/incorrect answers) allows us to score code objectively, scale to smarter models, and directly compare performance to other models. When we built the first iteration last year, I was surprised by how well it mapped to subjective experience with using models for coding. Games really are great for measuring intelligence.

BugsJustFindMe•about 1 hour ago
This may be a strange request, but is it at all possible to include Cursor's Composer models in your tests?
gertlabs•29 minutes ago
I am curious about the model, but for the most part, we have access to the same models that you do and only test models with standalone API releases.
XYen0n•about 2 hours ago
GLM-5.1 does not support image input.
scotty79•about 2 hours ago
I think the point is to use them both with GLM 5.1 delegating vision tasks to GLM-5V-Turbo
muddi900•about 2 hours ago
z.ai will use quantized models in off hours. Buyer beware
_aavaa_•about 2 hours ago
Do you have proof for this?
desireco42•28 minutes ago
I hear a lot of people complaining, I am on their Max plan, I never hit limits, use it non-stop and overall it has been fantastic experience.
yogthos•about 2 hours ago
I have a subscription and I have not seen any difference in performance during on/off hours. What exactly are you basing this on?