DE version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
76% Positive
Analyzed from 5473 words in the discussion.
Trending Topics
#models#glm#opus#gpt#claude#more#model#max#reasoning#open

Discussion (143 Comments)Read Original on HackerNews
All it does is pull a json from their main table page and parses it with the fields I care about (coding).
There used to be a mailing list associated with it but eh ... there wasn't much interest. I use the script every day though.
Current partial output
run it like so official repo where it lives: https://github.com/day50-dev/aa-eval-emailsome key takeaways:
* open models are on about a 4-7 month lag right now depending on how you want to measure it
* if this keeps up, you might see an open-weights model doing claude fable 5 level work before the new year.
if people sign up for the free mailing list (that just does this) I'll go and put it back on ... emails when new model evals drop - it was pretty useful.
- GPT 5.5 consistently the best, an opinion who gets me constant downvotes here by the Anthropic Marketeer strike force...
- China is going to eat the US lunch on AI
- What have European universities and companies been doing? Its like if, on a parallel past/future, Nikola Tesla and Edison would have created flying Cyberpunk machines, while Europeans researchers, would be getting together to request EU funds, for investigation on how to breed faster horses.
- If Zuckerberg could be fired, after spending a total of $235 billion on AI and having NOTHING to show for...should he be fired?
Mistral is clearly currently not competing for Frontier Model. Whether this is due to a lack of VC Funds or a lack of technical ability or the former arising from the latter would be interesting to know.
The top models are from startups. Among the FAANG only Google managed to get a Frontier model, and they litterally invented the architecture and have more money than they can possibly spend to throw at the problem. Facebook shows that even ungodly amounts of money don't get you there though.
So why did no EU based Startups succeed while two US start ups succeeded? I agree that that's a very important question the EU should ask. The Internet revolution was driven by US companies, and now AI will be as well, with Chinese Open Weights mixed in. The EU consistently can not turn its considerable economic output into fast moving tech firms.
And Zuck hasn't spent that much on AI yet. Half of that is projected spending for 2026.
As to whether it's all for nothing, Q1 2026 revenue was up 33% over Q1 last year, driven largely by...better AI-driven ad targeting. So the spending doesn't seem that crazy to me.
Also what are they building it for? I'd think it's to serve ads better or something like that. Maybe Muse Spark fits facebook's needs perfectly...
But if that's your thing, here you go: https://github.com/day50-dev/aa-eval-email/commit/1853be6461...
add an argument (any argument) and it will be sorted as your specified. It just works as a toggle flipping the order ... so literally any string will do.
The original link has been updated accordingly with the new code.
That aside, this is a good script you're running. Thanks.
Setup a fresh new large monitor. Open CLI. Run command. Watch output at the bottom of your screen. Keep watching the bottom of your screen for the rest of the day.
Sure you can tile windows and it helps but come on. Just have the command/input section in the bottom and the “output” on top. Keep the command bit on the bottom.
I know it's hard to improve on that, but now that their models are good enough at raw intelligence, I think this should become a higher priority task.
Currently on https://artificialanalysis.ai/#output-tokens GPT 5.5 xhigh spends 16k tokens total on average, GPT 5.5 high is 10k, Fable 5 33k, Opus 4.8 41k, GLM 5.2 is 42k. GPT 5.5 is extremely reasoning efficient.
Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think.
If you want reasonable token usage, you need to run it GLM 5.2 at High. There is little drop in quality from Max to High (for most tasks). And it cuts token usage by 2 a 2.5x. GLM 5.2, Max is really something you only need for complex tasks.
In essence, GLM 5.2 is Opus 4.8 its little brother, at a way, WAY cheaper price.
There has been really no training on Opus models going on, really, none i tell you! /sarcasm
Sarcasm, considering the source of their own training data?
To point where I stop it and simple tell it to “start writing code you can work it out as you go along”
Seems writers block also effects LLM
Just output the code and we’ll work through it!
I feel similarly about having codex review claude’s plans. I don’t think I’ve ever seen it catch a major issue. It just points out things that would have inevitably been addressed during implementation anyway.
Another thing I tell Claude to do is to not guess, but look at documentation, it messes up a lot less, might use some tokens reading docs, but at least it has a higher success rate code wise.
It's clear it was the vibe coding model, as like no other model before, fully turned you into his assistant instead of the other way around.
Per AA, while K2.7 Code is roughly on par w/ K2.6 in terms of intelligence, it uses half the output tokens to get there.
[1] https://z.ai/blog/glm-5.2
GLM5.2 ends up being far more expensive than I thought it would be when I tried it on openrouter. I ground through $5 USD worth of tokens quite quickly.
And this was high, not max.
https://artificialanalysis.ai/agents/coding-agents?coding-ag...
I thought I was "holding it wrong" until DeepSWE came along -- personally it seems to match my own experiences pretty well. Really makes me wonder how legitimate some of the internet noise is about open models. There's surely some use cases for them, not everything needs the absolute frontier (GPT5.5 on low is awesome), but if you want to be near the frontier everyone needs to be honest about the fact that we're only talking about Opus, Fable, GPT5.5.
openai, google and anthropic subscriptions are not available with privacy.
https://deepswe.datacurve.ai/
Fable 5 is cool and all, but we have not yet seen GPT-5.6.
It's easily 4x the cost of DeepSeek V4 but I didn't actually feel the results were that much better. I had GPT 5.5 in Codex review it after it was done and there was plenty of slop to go around.
Having better luck with MiniMax M3, from a cost/benefit ratio.
With a good harness, that's my favorite model for any personal project. I use Opus 4.8 at work because i don't have to pay for it and of course I love it, but DeepSeek is like 80% there for one tenth of the price.
GPT can find fault in everything and anything including its own work.
This is a huge blow to Anthropic/OpenAI/Google and a massive win for the rest of the world. The official API prices and speeds mean nothing for open source models.
Looking at openrouter [1], some of the cheaper offerings are for quantized models. Not sure how much intelligence is lost in quantization. And they are not 3 times cheaper. Where did you find 3x lower prices for APIs? I am considering skipping open router and using them directly for that price.
edit:
I see, croft [2] 8bit for $0.50/$0.08/$2.20
[1]: https://openrouter.ai/z-ai/glm-5.2
[2]: https://ai.nahcrof.com/pricing
I do not have GLM 5.2 numbers because the whole default max setting is overkill. But GLM 5.1 numbers had it at 12x cheaper then API rates. And about 2.5x more tokens vs zai their own subscription service.
Yes, its FP8 but lets be honest, do we know for sure that even zai runs at FP16? I learned a long time ago with Claude and Codex how much cheating happens on model levels, even from the big boys.
(there's a table which shows comparison between vendors)
Also, it seems there's a general one as well (for all kimi models?): https://github.com/MoonshotAI/Kimi-Vendor-Verifier
Wasn't this released like 2 days ago? Everyone is still evaluating and playing around with it, things like the submission is just starting to come out. Give it some days at least before jumping to conclusions, ideally weeks.
I've tried a number of these, and the learning curve is very steep compared to "install Claude Code and pay $100/mo". There is no way saving me $50/month matters compared to figuring that out.
https://docs.z.ai/devpack/tool/claude
Here's my setup. I add this to my .bashrc
export ZAI_API_KEY="your_key_here"
alias claudez='ANTHROPIC_AUTH_TOKEN="$ZAI_API_KEY" ANTHROPIC_BASE_URL="https://api.z.ai/api/anthropic" ANTHROPIC_DEFAULT_OPUS_MODEL="glm-5.2[1m]" ANTHROPIC_DEFAULT_SONNET_MODEL="glm-4.7" ANTHROPIC_DEFAULT_HAIKU_MODEL="glm-4.7" claude'
Then I just run claudez
pro tip the same thing works with deepseek https://api-docs.deepseek.com/guides/anthropic_api
Even more pro tip: Claude Code can set this up for you haha
Unless this were a massive differentiator, people aren't going to be "talking about it" the way GP suggests!
There's ZCode (https://zcode.z.ai). Which is like the Codex App.
That's as "easy" as it is for non-devs that you're complaining about.
https://github.com/QuantiusBenignus/Zshelf/discussions/2
Not accounting for hardware, of course :)
Nvidia GPUs are much more efficient than Apple hardware for inference(and training).
Not accounting hardware in my costs, since I didn’t buy my hardware for running models. Running models is just something it can do in addition to what I got it for.
link?
> Why
imho everything but opus produces unusable code (fable was even better...), eg gpt5.5 seems to write the absolute worst code that still technically solves the problem; tbh I'd be totally willing to trade "raw intelligence" for "code taste"
more labs need to figure out whatever anthropic did to destroy everybody else on frontiercode bench
That's actually pretty uncommon these days. All of the OpenAI/Anthropic/Gemini models accept images, and so do the other leading open weight families - Gemma 4, Qwen 3.6, Kimi 2.x.
In GLM's case image input would be useful because it's a model that scores very highly for tasks like web design, but without image input it can't take a screenshot and output HTML+CSS.
Don't get me wrong, GLM is a phenomenal model, but the image thing is a bit of a gap.
Even the local models I run on my Mac are getting surprisingly good at that now.
Honestly it's good enough that I feel comfortable recommending a Z.AI sub + a $20/mo OpenAI sub for all but the most AI pilled multi-orchestrators, or the die hard Claude fans. GLM writing + GPT reviewing/debugging feels pretty unlimited and minimally worse than just doing everything in GPT with the $200/mo plan.
Discovered today that they set reasoning effort to max by default. So that’s probably why
Review the commits with both Claude and GPT 5.5 Xhigh. You can see that Fable is still sloppy(er) compared to GPT. You can test it the other way around as well(drive the dev with GPT and review with GPT and Claude). You get the same result Claude has an edge though and that’s on building more beautiful user interfaces.
It’s expensive, and not as capable as the frontier models, but would have some pretty big benefits around privacy and agency.
Not everyone is willing (or even legally able) to send their trade secrets to OpenAI or Anthropic
We’re approaching a world where running a primer frontier model is possible on a workstation, probably will have something under $30k that looks like a desktop for Nvidia’s next generation. It sounds expensive, until you look at your Anthropic bill.
It’s similar unit economics as could computing for the open models. You can save a ton on the expenses by buying the hardware, but it requires a lot of in-house expertise, and you get the most value if you keep the system operating around the clock. The big kink is open models are usually 2 quarters behind frontier, and your competitors are probably trying to get access to mythos.
But prices are changing rapidly, and not for the better
Your usage will peak during certain timezone work hours(even if you are a huge multinational company most of your engineers/users tend to be from only a few locations), so then you have a bunch of gpus doing nothing the rest of the day. especially with latency sensitive stuff, this is a decades old tradeoff problem, its not unique to llms
Would need to be a pretty determined medium biz
I'm pretty skeptical, especially given typical utilization patterns. Do you have numbers, or this is just vibes?
Years.
Even Microsoft said they don't have enough for Github and need to call Amazon.
Getting a few even at decent prices is hard. Unless the shortages goes down...
am i missing something?
We have no proof in either direction, it's not like we had access to their financial numbers in details.
And the pricing itself muddies the water, as input tokens that are already in the KV cache are practically free for the provider, whereas other tokens are expensive. So they could still make money overall thanks to people having multi-turn conversation (and as such, paying multiple times for the same token), but lose money on actual compute done.
> there are lots of third party hosting services that will still run at breakeven/profit.
How can you be sure that they are making profit directly from token price, and are not billing at marginal cost (i.e. electricity price, without counting the cost of the GPUs) and aiming to make a profit later on from the valuable training data that they are collecting in the process?
[0]: https://aibenchy.com/compare/deepseek-deepseek-v4-flash-high...
This means, that models are losing more and more general and domain-specific knowledge.
Look at those graphs on ARtificialAnalysis, GLM-5.1 still performs similarly or better:
AA-Omnisicence Accuracy: https://i.snipboard.io/5DYmpx.jpg
IFBench: https://i.snipboard.io/74kg0R.jpg
I still feel like models are not getting any smarter for a few months already, they just changed their training to be focused more on some areas than others, so shifting the intelligence from one place to another, not necessarily increasing the overall intelligence or "AGI" score.
It also means that if they actually trained with vision, they'd be on par with Anthropic models as vision seems to improve model performance across the board even for non-vision tasks.
With open weights LLMs, it is affordable to use many different models, each for whatever it is better.
Moreover, for analyzing "UIs, photos, screenshots, etc." there are small models that can be run locally on smartphones or laptops, e.g. IBM granite-vision-4.1-4B, certain Google Gemma 4 variants and certain Qwen variants, whose output you can use as input for a big LLM, in order to accomplish some more complex task.
Excited to see if this turns out to be a Open Weight Opus 4.5 or better.
I've had models that benched poorly but performed great. And I constantly see models at near the top of AA, which are terrible.
There doesn't necessarily seem to be a lot of overlap between benchmarks and real world usage. (Let alone common sense!)
As far as they go, though, these harder benchmarks match my experience more closely:
https://deepswe.datacurve.ai/
and https://cognition.ai/blog/frontier-code
Where we see "top" models drop way down in score when given longer tasks.
That being said, I've had a reasonably pleasant time with GLM-5.2 so far. (And have had an OK time with DeepSeek as well.)
By the time I'm done testing all the Chinese models, they'll be obsolete :)
I haven't extensively used 5.2 yet, but it seems a lot better.
QWEN 3.6 27b is already pretty good, but it should be possible to get a better option now that runs in the same hardware, right?
GLM-5.2 is already close to Opus-4.7 level:
https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu...
QWEN3.6 27b is pretty good, but i can still notice some spots where it's not as good as the frontier models.
I work on mid-sized projects currently (200k to 1kk lines of code).
That's the one benchmark that allows LLMs to answer "I don't know" and punishes them for trying to bullshit their way through the questions
DeepSeek V4 has been quite amazing in our workloads and it operates at a fraction of the cost. I have not tried GLM 5.2 but it seems that it hits a sweet spot.
Their servers are melting though - getting more timeouts etc
I feel like i threw 15 dollars in the sea. I'm getting rate limited after 3-4 prompts. You get way less value than just paying 25 dollars for Claude or OpenAI models.
My workflow is usually:
- read file. I want to achieve X, how do? Do not implement anything.
- I would do a, b and c
- sketch a brief implementation of your suggestion
- <code> (not writing files yet)
- instead of your approach x, wouldn't it make sense to instead do z? What would that look like?
- <code>
- nice, implement this
- starts writing files, run tests, etc.
You'll see that it quickly gives up. Thing is, they seem to count cached hits as if they were the non-cached tokens.
I wont be subscribing again thats for sure. I am not paying iPhone money for a Xiaomi.
That is unfortunate...
To all people on Hackernews, I am curious as to what agent harness are you using it with.
Previously I was using opencode and then I switched to using Opencode + obra/superpowers and creating custom skill.md themselves for it. I found things to take more time and intervene more but the result of it has been that I have found it to work better.
Now I have also started using oh-my-pi as well and I found it to be faster compared to Opencode.
I am unsure how much of there is a difference to it and how much of things are placebo but what is your opinion regarding the best Agent harness for GLM 5.2?
Somebody wrote [1]; "I am never touching Minimax or GLM again. Their APIs had constant outages and I had to restart my runs multiple times — after burning money on the runs that failed midway." and I 100% agree.
The model might be good, but if the API is so bad, it's effectively useless.
[1]: https://kasra.blog/blog/i-spent-1500-seeing-if-llms-could-ha...