RU version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
79% Positive
Analyzed from 4005 words in the discussion.
Trending Topics
#glm#model#opus#more#models#shot#real#coding#prompt#open

Discussion (106 Comments)Read Original on HackerNews
by definition, a single prompt wont' constitute the complexity of a software project. ergo, what you'll get is a series of assumptions made by the model based on preexisting code in its training corpus.
I'd rather see a coding agent that can follow steps in a plan file to a T while following guardrails and adhering to the proper coding conventions in the human reviewed spec.
Id rather see performance in agent loops against human defined objectives where it can be verified to stick to defined guardrails and continue without drift till its objectives are complete.
I'd also like to see it identify bugs and potential performance increases by identifying existing code and suggesting refactors based on context it can pickup about the particular use case you are trying to create.
These are way more valuable metrics than "hey build X"
Right, model intelligence defines the scope of things they can one shot
I also suspect that users naturally calibrate to a model's useful scope, gradually getting positive/negative feedback and gradually making their requests bigger/smaller than before
Similar to how ML was all the hype about 12 years ago and then it submerged again for a couple of years.
On another, being able to reliably tackle minor tasks with no handholding is very valuable in itself. Sometimes implementation details are important, but often, the most important thing is to Get It Done.
Guardrails/conventions should be enforced in linters, formatters, static analysis tooling; not specs/prompts.
Additionally, with "Hey build X" nobody is happy with the methodology and people rightfully complain about the set up.
Using your suggestion the methodology would require a lot of presumptions & arguments regarding why you choose it and think it relevant to people.
Either people would not "get" it quickly enough or would disagree/not be interested on the setup because its not how they use LLMs.
The agentic engineering paradigm is just a narrative trend pushed by AI companies to get people to 10x their token consumption per prompt. It plays into people's laziness and addiction to dopamine too causing addict like behavior in people that fall prey to this trend.
If I do that, I'm literally slower then just doing the change without sufficiently specifying it to the model.
I can see how a junior dev or generally someone that's not particularly knowledgeable about the language or framework they're working with may benefit from such usage, but for experienced people there is very little value in that approach.
I say this because I've just had to face this decision this month with Copilot introducing the usage based billing. I attempted to scale back my usage, first with non-opus - output essentially became discardable as it continually hallucinated no existing fields in the responses of Apis etc... Then my scoping the changes smaller and smaller, until I ultimately gave up and reduced usage to just generating tests.
"Well obviously you provided better follow-up prompts to the one that came out better."
Also nothing about human-provided plan files and guardrails preclude the one-shot benchmark test. Heavens, I almost said "real coding," but in "real agentic program creation" you'd obviously be doing multi-turn interaction with the agent, but how can you provide a fair test when the model's output n determines your n+1 response?
The business guy would say "hey build me this and that" and would get _something_ to show of.
An engineer will have a long conversation with a llm about the exact requirements, tech stack, tradeoffs. He would understand what is built, how is it built, and refine on the fly until he gets something sensible.
It won't be as fast as "build this", but the result will be much better and more maintainable.
For the enginering workflow, you don't need Fable. Any model better or equivqlent to Sonnet 4.6 would do. Yes, sometimes it will hallucinate, sometimes it'll be wrong, but it's our job as engineers to correct it and have full ownership of the result.
Instruction following has been down for years, and while there are of course metrics that continue to improve as the frontier advances (for example, the ability to continue following the original instructions even as context grows), you can't really get that much better at performing a list of instructions as-written if the instructions are sufficiently precise enough that there's no wiggle room for interpretation (which seems to be what you are describing).
For example, one of the things that got me the most excited for Fable 5 was its ability to work for over eight hours straight on a single instruction and seemingly faithfully the entire time. That was something I observed personally after trying out the same workflow that runs for maybe two or three hours with Opus and then still needs followups. Fable needed no followups. That's a game changer for me compared to the prior state of the art.
That kind of stuff is going to end up being the most beneficial to people who are touching the edges of their knowledge or even exploring completely new areas. And that type of work is exactly the kind of work that makes agentic coding so powerful, even as much as it gets harder to judge the quality of the work when you lack the skills yourself. It's a good thing that the quality increases across the board, even for skilled practitioners.
For example, even people who know how to write inference engines or how matmul kernels work or how to optimize model architecture can't always predict just the sheer breadth of things agents can try to improve performance, and sometimes you get over some wall and reach a completely different optimum that you just wouldn't have reached in any reasonable amount of time by applying traditional knowledge even if you're an expert in the field.
That kind of stuff is amazing. And that's exactly the kind of stuff that one-shot prompting is testing for. It's kind of like testing for the model's "innovation", as much of an oxymoron that is.
Since Opus 4.6 I've seen later Anthropic models being more and more capable on one hand, but also less useful on multi turn open tasks.
It feels like with each model they are more and more prone to go "their own way" and jump into the implementation as soon as they can.
I can't but blame it on benchmarks and fine tuning around prompt-to-solution work.
Running a single one-shot prompt is not a benchmark, not is it representative of any sort of real-world usage.
Most agent usage is collaborative so you need to test things like reliability (when I delegate a task, does it complete it without making up test results for e.g.) and steerability (does it obey my instructions or does it just do what it thinks is best).
I think your test you describe (collaborative, task delegation, task completion, TTD, steerability) is a great format for a future test that I will definitely try out.
Appreciate you sharing the results of your tests though!
Thinking about it, I would say that the majority of agentic work I do, by a long shot, is subagents which are launched from the main session, using a prompt of its choosing. Those could be considered short versions of these fully autonomous tasks.
You make a very strong claim at the end that the hype is mostly real, and making it clear to what extent your claim holds should help the reader.
- it takes it sweet time to get code rolling, not the fastest model by any means
- it strays a lot during discovery/planning but then corrects
- it's not steering friendly, as it hallucinates things that it doesn't follow later on
- its output is quite good
A sample use case: I was optimizing rendering on Swift+Zig codebase. It chocked on 5k data entries.
GLM 5.2 spent 20 minutes building the benchmarks and getting data out, which made me frustrated so I blocked non-editing tool access and went AFK, after approx. 30 minutes I found that it used already-made benchmarks and some "conclusions" to optimize 3 choke points. Output pointed that it couldn't validate suspicions and asked for more data.
Implementation worked well, it was idiomatic and non-intrusive. I would even say that it was more idiomatic than GPT 5.5 effects on same repo.
I would opt in in using it more BUT GPT usually completes same requests 5x faster.
GLM 5.2 was spark for preparing and running inside isolated containers with JJ workspaces (so that multiple can be ran in parallel).
Can someone explain to me where that time usage is coming from if not from the model operation itself?
Are the individual tool calls more complex and take more time to complete? Or is the rate of tok/s lower because the model does more compute per token?
In addition to that, some of the open weights models like GLM 5.2 or DeepSeek v4 Pro tend to be MUCH slower when generating tokens, which contributes to the perceived slowness. Although I wouldn't call models like GLM 5.2 slow by any means, e.g. it is currently one of the fastest models inside Notion today.
Yes, in terms of API pricing, GLM 5.2 outperforms the competition. But the only people that use API billing for their coding work are large corporations, where these highly subsidized subscriptions are being fazed out.
At the same time, none of these companies will use a Chinese coding subscription for their employees.
For individuals and smaller teams, Z.ai's coding subscription is outperformed by Anthropic and OpenAI. You probably get around the same usage with Claude, but Codex definitely offers more usage for the amount you pay.
We can have a debate how much Z.ai closed the gap to GPT5.5 and Opus 4.8, but if I can freely decide between them in a world where they all cost the same, I simply wouldn't choose GLM.
So the important question becomes: How good will the offering from Z.ai get with GLM 5.3 or 6 and how much will OpenAI and Anthropic cripple their current offering in the near future.
Employees and students used to coding with thousands of dollars worth of tokens (on a 20/100 dollar plan) will push enterprise to spend.
Having a Chinese model that is competitive won't displace this enterprise spend. But an open model hosted in the US/EU might.
The existence of GLM 5.2 puts a ceiling on how much OpenAI/Anthropic can charge for API Access.
I haven't been keeping up on hardware costs for state of the art LLM inference, but this remark made me ask myself how many readers of the article would actually be able to run this model on hardware they own. How much would it cost to acquire such a setup?
Off topic, but does anyone else instantly pick up on LLMisms like this? It seems like all the models have converged on this style of writing, and improvements aren't really changing it.
For people who follow open LLMs, none of these were quiet and all were the most interesting open model release for a few days/weeks. In one or two months, it will be some other model again. Now I do appreciate the real rapid improvements in open models. But there's also a ton of hype and fast-fashion around all of this.
From my Opus vs DS 4 Pro personal benchmarks, 16 different real-life work tasks, DS 4 has performed as well as Opus 4.8 high overall but with few drawbacks:
- on the 16 tasks, one needed several prompts to be steered back into the topic
- its review capabilities seem much worse
- DS4 had the cleanly better solution in 3 cases out of 16, with Opus "only" doing cleanly better 2 times out of 16. But still, I want to emphasize, is the worst case scenarios that imho matter the most, not the best ones, and on that front Opus outperformed.
That being said I spent less than 2$ of API working 4 days, which is more or less what I would've spent with Anthropic APIs for less than one task.
My only, I guess feedback, is that it's not really clear about the price.
Would the 21.92 be the API pricing I guess?
Cost $5.39 (real billed) ~$21.92 (estimate, list pricing)
Pro tip: You could use a multi-modal model to verify images as a subagent spawned by GLM 5.2, to get around this issue.
Opus is most expensive model in pay as you go model, but IMO fair comparison should include subscription price as well. For example when one has $100 Claude Max and use it up through the month, it might not be more expensive than GLM, or at least not 5x.
And z.ai themselves also have subscriptions.
https://z.ai/subscribe
I’m currently trying to figure out whether a downgrade from Max 5x to Pro in combination with one of those would save me money and if so, how much.
Edit: seems like Anthropic Pro + GLM Pro (Yearly) would let me almost halve my costs of Anthropic Max 5x. Only concerns are about GLM 5.2 not having vision support and also being kinda slower and also not being as good as Opus.
I think it's most fair to compare the plain token pricing that is used by everyone.
As a consumer, yes, it's totally fair. All that matters to me is the price I pay at the pump, not whether that price is "real" or not.
Anthropic have claimed they expect their first profitable quarter this year -- they may have bigger margins on their raw API than you realise.
The real time 3d fluid dynamics appear to be the tricky part, I wish I still had opus access, would love to see if it can do it.
I am not sure where this is going to lead us but it is fun to watch.
At the end of the day, the time earned is more important then the cost for big players.
The ability to spawn 10 claude agents and rush a project to outcompete someone is more important for big businesses in my imo. Also the small details that GLM missed would take significant more time to iron out, considering it already took double the time.
I do hope other (open weight) models catch up, but to act like they are anywhere close for me is a bit disingenuous.
Also, every single lab does RL on benchmarks, which is why Opus 4.6 was the last truly great assistant, after it, all models tend to drift into implementation asap.