ES version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
73% Positive
Analyzed from 20113 words in the discussion.
Trending Topics
#code#more#don#models#claude#lot#model#coding#quality#using

Discussion (579 Comments)Read Original on HackerNews
A lot of people here stated that this is a ridiculous metric, but no one seems to remember that it was introduced in the initial GPT report ("Sparks of Artificial General Intelligence: Early experiments with GPT-4" [1]) by Microsoft about 3 years ago. Shortly after that it was parroted by a network of booster accounts and became a thing every clueless AI hype peddler does to "test" models.
100% marketing, 0% science.
[1] https://arxiv.org/pdf/2303.12712
[0] https://simonwillison.net/tags/pelican-riding-a-bicycle/?pag...
[1] I'm sure there is because of Simon's fame
The point is that the prompt has a subtle ambiguity - "how is the old man going over the river?". My sense is that most humans would quickly imagine a conventional bridge with a road on it leading over a river and with the river background being in an area developed enough to allow bridge going over it.
So the implication I draw is these things can find/generate stuff that roughly satisfies the conditions (and are getting better at this) but they still fail add the assumptions that people would draw.
So my conclusion is that LLMs are getting better and better at "what they" but there are going to be places where they fail to satisfy human common assumptions.
I have mixed feelings about this. I agree with the default assumptions you have as to "what people would draw", however what do you want from this cognitive automation?
Do you want, "what most people would do" or do you want "something creative, an outlier, that still satisfies conditions" ?
Solvers are generally really good at bending your rules, but in a context where you want that. An outlaw rule-bending maniac is not what I want from a helpful agent.
[^0]: https://www.youtube.com/watch?v=mrmqRoRDrFg
Whether I am building hardened engineering systems, or discussing cooking methods, or discussing sensitive health concerns, or navigating complex psychological and interpersonal issues, the model will inevitably have to make some assumptions about context I haven’t provided. I want to know that those assumptions are grounded in reality.
For what it’s worth, a slack-line over a river in front of a medieval town is too anachronistic to be interesting, let alone the idea of an old man riding a bicycle well enough over a slack-line. That is output that was not grounded in a solid world model, regardless of how “creative” it was.
But that's also why we invented formal languages like math and programming. Because there's a lot of times where we don't want ambiguity. Law is basically mankind's greatest attempt at making natural language unambiguous and it doesn't take a genius to realize that that's a shitshow and never going to happen. At the end of the day, to make natural language even relatively low in ambiguity requires a metric fuck ton more words than it would take to express via a formal language (which are also overly pedantic and verbose)
So the problem is that the AI doesn't share those expected decompression strategies. Sure, many humans won't either, but developing a shared language is essential for properly communicating with others. We've all worked with someone who feels like they're speaking a different language. It's exhausting, right?
They definitely get something barebones up and running, but it's far from a fully fledged application.
I did write some stuff myself just to learn how the enigma encryption machine worked, so wrote myself to learn. But professionally, I stopped coding in November.
Now, I'm using Claude or Codex (GPT-5.5) for frontend and backend and it just gets it right first time more often than not. I've been making use of things like LSPs, Context7 and CLAUDE.md (global and per-repo) and it just stops doing the dumb LLM things that I hate.
AI just changed how I edit code - I still see coworkers (senior developers) failing with Claude/Codex and get stuck when there are trivial solutions if you understand the full problem space. Right now AI is just a productivity tool.
Writing the actual code is a significant part of that, but the codebase is so complex that even Opus 4.7 and GPT-5.5 struggle with it without being fed a *lot* of context and constraints. And even then, they need a *lot* of steering due to making bad decisions that only someone with an intimate knowledge of the theory behind our software is able to catch.
I can only assume that people who think coding agents can completely replace an actual developer mostly deal with trivial software regarding both scope and the type of customers they serve (individuals instead of big companies in industry).
Most good developers are not employed because just because they can code well.
What is over is: fizzbuzz and trivial CS algorithm regurgitation as a gate.
What you're saying is like "how do you justify your salary as a NASA engineer when anyone can use Simulink and generate the code?"
It is extremely ignorant.
Coinbase is paying the price for that for every UX glitch, after the CEO was gleeful about HR personnel shipping production code
It will almost never converge on the general solution that will pass tests you haven't given it yet.
This is why AI is sooo good at Javascript and related slop. A solution that "kinda works" is good enough 9 times out of 10 and if some tests fail well ... YOLO and the web page will probably render anyway.
Contrast that to using Scheme or Lisp where AI will have trouble simply keeping the parentheses balanced.
I still must hand hold it every day, as it always does things wrong. Especially after it got seriously nerfed in March.
Note: experiences vary a lot depending on the programming language used, and projects. And the experience of the person coding.
'Nail Guns' used to be heavy, required heavy power cords, they were extremely expensive. When they got lighter, cheaper, battery pack ... at some point, they blend seamlessly into the roofers process, and multiply dramatically the work that can be done. Marginal improvements beyond that may not yield the same 'unlocks' because the threshold has been crossed.
Key has been to spend a fair amount of time on initial overall design document, which is split into tangible and limited phases. I go back and forth between them on this document until we're all happy.
For each phase an implementation plan is made. At the end, a summary document of what was delivered and what was discovered. This becomes input to next phase.
I do check the documents, and what they're doing. I also check the tests, some more thorough. And some spot checks on the code to see if I like the structure.
I have mainly used Claude for coding and Codex for design and code review after phases. I ask both to check test coverage after phases.
Managed to implement some tools and libraries without writing a single line of code this way, which have been very beneficial to us.
Since it's so async I can work on other stuff while they plod along.
I think it's not universal though. But stuff that can be tested easily and which you have a firm grasp of what you want to achieve, but not necessarily exactly how, that I've been impressed with.
But yes, I did think that it sorta felt like being a team lead for some eager programmers.
While both Claude Code and Codex are capable harnesses, I definitely think there's a lot more to be gained from the harnesses. Quite a few of the times I needed to nudge the steering wheel it was things that a separate agent with the right prompt could have picked up on.
> For each phase an implementation plan is made. At the end, a summary document of what was delivered and what was discovered.
> I do check the documents, and what they're doing. I also check the tests, some more thorough.
Sounds like programming, but with extra steps.
When I said I check the documents, the initial design document was the only I really took a hard look at. The intermediary I just skimmed, looking for red flags or something I had forgotten to tell them. Those documents served as a basis for their work, and as a record of what was done.
Overall I spent perhaps a few hours on each project, over the course of a few days. I'd check in every half hour or whenever I had time, tell Claude "Great, let's do the next deliverable", or GPT "We're done with phase 4, please do a detailed code review, reference the design document and documentation of previous phases". Then I'd leave them cooking.
I didn't use it often, but when it was needed it was needed.
I’m building something using LLMs to scrape websites/socials for unstructured event data from combined text/images and the only way I’ve managed to get 100% consistent results for a reasonable cost is to break the task down into very small pieces that reduce the scope of mistakes significantly.
At present, for reasonable complex tasks, Codex/Claude will happily code you into an expensive corner.
(Even when they're getting the planning part right, I do also recommend checking the LLM-generated unit tests, because in my experience some of those are "regex the source code" not "execute functions and check outputs").
GPT 5.5 is a significant improvement over GPT 5.4 but I wouldn't call it an inflection.
I think the smart zone stays within the first 100k tokens, no mater if the context window is 240k or 1 million.
I divide the work to fit within that 100k and use subagent for the tasks.
Once I work out the kinks, I’ll be able to further automate it.
Would have taken 10-100x as long for me to build it without AI and the AI version is probably better.
But yeah, I have enough knowledge to know what prompts are needed and figure out those “oh, I think it’s running slow or failing because of xyz” and further prompt to improve it based on that what I think it should do instead.
And I know where to make slight changes without burning my allotments.
Gemini Pro on the other hand can be quite a pleasant experience.
What changed I think was the context harvesting capability of the models. What most programmers did was - debugging and figuring out how something works were the time consuming part - the fix was usually trivial. And now models could do in seconds what took a developer hour or more.
If right now we create a smart grep that just takes everything for a piece of code and outlaw llm-s we will not regress to the previous level. The developers needed this context as much as llm-s to do their job.
When people claim LLMs just don't work for them, the first question is whether they're using the latest model or not, and if not, dismissing the poster.
The thing is that that same question was being asked a year ago, and even a year before that, but with the models that lead to a dismissal today.
Just make the experiment yourself, wait 6 months, say LLMs just aren't working for the software engineering that you do, and people will dismiss you if you say that you use Opus 4.5 and not the latest model Claude MegaMind 8.8 pro max gigathinking. Despite this model being touted as the inflection point in this article.
But a lot of people excited about new generations(including me, now) are not seeing it as a dichotomy but rather a spectrum where models are getting better and indeed once a year or even 6 months at times there comes a sudden growth which feels like an inflection point from what came before. Practically, it's a tool like any other, you evaluate it based on if it's worth the effort and cost for the benefit you get from it and if it is and has a good DX you use it. If the calculation doesn't work for you, it doesn't. For me, it has gone from a novelty, to good for some kind of quick manual search, to I guess it can debug some kind of errors at times in very specific conditions, to hey I think I am getting a bit addicted to autocomplete in IDE provided by them even if I don't use them for anything intelligent but it's becoming indispensable now but only this part, to it's good for areas I lack expertise in, to agentic sucks I will stick with discussing algorithms and architecture with it on greenfield projects, to holy shit it can do agentic decently well now, I am skeptic to give it access more than in limited cases, to now I am getting close to letting it run free on my device in not so distant future I guess. Some of these were big jumps, at each point I was skeptical of growth. Everytime I thought now the growth will slow down from days 2k context window to millions now. From basic chat completion to working on complex adaptive systems, game theoretic modelling, heurestics and constraint modelling and other things I throw at it. I am still needed in the loop, it can be so smart at times and then will do something so stupid, but the frequency of stupidity is rapidly decreasing. I am still needed, I don't think it could accomplish alone all that it has done for me. But I do at times at night remain awake reflecting on my self worth for the potential day when I don't add that value. When I have a harder time keeping up.
Also had someone told me not in even 2019 that in 2026 we could have NLP models do what they do today, I would have posited it all as sci-fi and here I am waking up in awe of the world we live in and how quickly we adapt.
Just take a look at this comment on a different topic, which lists all the pre-requisite for those AI models to work well, from the perspective of someone who has bought into the hype: https://news.ycombinator.com/item?id=48157235
If this is everything needed for an LLM to generate acceptable code, what is even the point of them?
There have been plenty of small issues like tables not having the columns aligned, or the game menu being a bit offset, or one graph being a placeholder instad of connected to the actual value. And of course I've had to instruct it on all the flavour I want.
But honestly, for a simulation strategy game, especially without doing the "proper" setup from the start, it's been _very_ good.
At any point you need to have agents review, verify and test the other agents output and iterate until the output is perfect.
And also, have good e2e tests.
IMO, if you don't spend at least a few tens of millions tokens per day, you aren't doing it properly.
It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".
All I observe they got better at tool call and answering questions about big codebases, especially if the question has a vague pattern to search, and they're superuseful for that! For generating production code even with a lot of steering and baby sitting?
Absolutely not, not quite there not even close in my experience.
But we should stop talking about 1s and 0s, especially with marketing hype trains, there exist a gradient of capabalities that agents have that really depends on the intricacies of the codebase you're working on, I think everyone has yet to discover how to better apply these tools in their day to day work.
But that totally collides with the current narrative, that flattens out our work to be always the same and that can be automated easily in each case, it's not!
That's why the debate is so polizered imo, there isn't a shared experience
For example, I've had the opposite experience of yours, generating very high quality work using Claude (such as https://github.com/kstenerud/yoloai). Just in dealing with all the bugs and idiosyncrasies in the technologies I'm using, the agent has been a godsend in discovering and cataloguing them so that the implementation phase doesn't keep tripping over them: https://github.com/kstenerud/yoloai/blob/main/docs/dev/backe...
And the agents keep getting better all the time. Even in the past month I've noticed a considerable jump in its ability to anticipate issues and correctly infer implications as we build out research, design, architecture and planning docs. By the time it comes to coding, it's mostly a mechanical process that can be passed off to sonnet with a negligible defect rate.
Like, no it doesn't seem like very high quality work... It just seems like a vibe coded tool.
Edit: yes it's wrapping Claude. It's BREAKING the TUI. Not sure what people aren't getting here...
The problem with being such a naysayer is that you're entirely disconnected from what's going on. You haven't tried an agent like Claude Code and experienced it for yourself, so you don't recognise what it looks like when it's in front of you.
FWIW, I agree with Philip here. I don't think this screams "high quality" to me. I'm also not trying to take a shit on your project. Nothing screams "terrible" to me, but yeah, it does look a bit sloppy. There's no polish to it. It looks like someone that grades on "it works" and that's fine. But it also isn't everyone's cup of tea. Where the sloppiness comes in is like what Philip said. First thing I saw was the gif and well... I think Claude Code is sloppy. But this is also a great example at how and where LLMs visibly fail. Creating a box in text is pretty simple. There's tons of tools to do it. And the LLM 100% knows about characters like ⌜⌝⌞⌟⎜, it just doesn't use them and doesn't care. The code itself also looks very LLM generated.
It's fine and I don't think you have any reason to be ashamed of it, but I also wouldn't go around boasting that it is an example of high quality work too. And FWIW, I can't think of a single heavily LLM assisted code where I don't have similar feelings. I've seen stuff with more polish, but yeah, they feel off.
This is a space I feel weird in. I love the terminal. I love that there's a lot of new TUIs. But it also feels very weird because it is extremely clear that a lot of these new TUIs were written by people (or machines) that don't really have a lot of experience in the terminal itself. There's a real shared language by people like me who live in the cli. There's a reason people like me can pick up a new tool and guess certain flags and certain ways to use them. It's because of a shared design language that we know of and we end up writing that way because we know it reduces to cognitive load on our peers. But the LLMs? They don't have that shared experience.I think this is true for a lot of stuff, not just TUIs or bash tools. Things just smell... off...
That's not what this product is; merely a tool it uses.
> .row > div > div, .alert
This is fairly simple CSS, not multi-threaded systems development. A bar low enough that you could trip over it. I catch this kind of stuff all the time (literally every run), but only because I read every line. Most of it wouldn't be the end of the world for any particular task, but would eventually result in a complete mess.
I think the people doing the heaviest breathing around the elimination of programmers either aren't very good at programming, or they're not paying close attention. Or they're hyping their book.
LLMs have traditionally had problems with visual rendering (the good ol' pelican on the bicycle test). I wonder if this is more of the same?
Yeah, absolutely. People think you're picking on, like, code formatting and no, dawg, your code doesn't do what you think it does, or it only handles the happiest of happy paths.
I do find it funny when people get mad about you critiquing their AI project. You didn't even write it, dude.
Amazing how the LLM is godly with things I don’t understand, and falls over completely when it works in my domain… I wonder why that is /s
As I commented on another thread
> If you're trying to solve a HARD problem people REALLY have, it's a novelty that agents can't help with, otherwise if it gets 97% there MAYBE it's just a signal that your idea isn't that novel!
Reverse engineering a proprietary protocol from a binary executable.
I heard about people finding security vulnerabilities in compiled code with the combination of Claude Mythos wired up to a disassembler like NSA's Ghidra. Someone here mentioned that GPT 5.5 "extra high" is just as capable, I had a problem to solve, spare token quota for the week, so... I gave it a go.
My problem was that I'm working with a product that uses a legacy 1990s style network appliance output log format that is proprietary, undocumented, and has no publicly available decoders other than an app by the same vendor, and that app has fundamental limitations. (I.e.: it's nothing like Splunk or Elastic.)
Codex with a Ghidra MCP bridge figured it all out: the framing, bit and byte packing, endian order, field names, data types, etc. It made me a neat little protocol parser in a modern language that I can use to spit out something sane like NDJSON or OTLP protobufs.
There is no way I could have reverse engineered this myself from compiled C++ code and/or packet captures! The format isn't self-describing and is incredibly dense (similar to NetFlow). In a hex viewer it looks like line noise!
> For generating production code even with a lot of steering and baby sitting? Absolutely not, not quite there not even close in my experience.
As I said, this is an example of using AI successfully to produce a high quality product (one that I use every day).
But to your point: I am solving hard problems that people really have. You just don't see those because I haven't mentioned them publicly yet. And they won't be released or talked about until they're ready.
I use Claude all the time, it is immensely helpful. It is also very nuanced and requires a high level of expertise in a specific domain to produce quality work. Even then, that take time and effort. Anyone saying otherwise, quite frankly, doesn’t know what they’re doing.
Not just when using tools, also when using humans. The frame of reference of what is considered 'production code' differs immensely between organizations, teams and people. The code I get from LLM's is usually much better than what I get from my peers. Maybe not one shot, but after some steering it gets there.
It also isn't lazy. When generating test cases for relatively simple pieces of code, it usually tests pretty much every path and doesn't stop right at the 80% code coverage quality gate.
I can imagine if you're at the level of Linus or something, you might conclude differently, but most people aren't there at all.
I think it’s really down to this. Nobody can agree on what counts as production-quality code. I remember joining a company with what I think (hope) most of us would call horrible quality code. It was an absolute mess, barely compiled with hundreds of warnings, and had uncountable number of bugs. They didn’t even have a bug tracker so nobody even knew how many they had.
But the people working there already were so proud of it! None of them had ever worked for another company so they had no idea how bad their code was in comparison with the rest of the software industry (which itself is a very low bar). I told the founder we had a huge code quality problem and he looked at me like I had horns growing out of my head.
When someone says their LLM is producing “production-quality” code, actually look at it and see. Arguing about it on HN is pointless because everyone’s quality bar is different.
You'll also notice that Linus doesn't poo-poo AI at all. His only gripe is with people using it wrong, such as flooding security lists with drive-by security reports after pointing their agent to the code and saying "find me some VULNS!!1!1!!"
Then you should seriously question for who you're working for imo.
> It also isn't lazy.
It is indeed lazy in my experience, as in being overly zealous when creating useless test cases and ignoring the important ones. I don't want it to test a sum I want to know a test that can "guarantee" me that a further change doesn't break existing code. And producing this high quality in tests is HARD, and requires a lot of steering with agents. This culture of tests code coverage is just wrong, the best code base I worked with had code coverage only on the net percent of code that matters, the rest is covered by for static type checking and integration tests
Granted, I'm mostly working in small-to-medium codebases, 20k-30k LOC incl test suite. I wonder if that's a factor in my positive experience. Curious to hear your thoughts.
I see patterns and solutions emerging from hand coding, I'm not the other way around, I can't start with a prompt, unless again I have the feeling that the task can be one-shot with minimumn effort and context.
Starting with a prompt, or in plan mode, it's not how I trained as an engineer, I cannot foresee what something should be/look like until I explore it myself with code I can relate to, that I'm connected with and that I fully understand, for example my muscle memory suggest me to use a specific data structure only after I see some code patterns emerging, hard to explain hopefully makes sense.
If I ask the agent to do that initial exploring, even with a tremendous amount of instructions, guidelines etc. it usually start with a path I wouldn't have started with. What I tried in such cases is to stop it, correct it and generate again, only to end up with more prompt words than lines of code. This is true for every visual task I'm working on (I program non web UIs). Let alone doing it via spec files, if it's something I don't care about yeah sure, maybe a little tool for entering/editing data, but alas it always default to slop web apps, and I get it I mean most of the training set is on web apps
Probably where the mismatch is in this discussion. The measure of what is quality code is all over the place. For some, some form of "good enough" is quality. And for others, metrics like terseness, readability, vacuous amounts of comments, cleverness, various fuzzy measures of "idiomatic", etc, make "quality code" much more of a moving target.
In general I tend to agree with you if you're talking a codebase you are deeply familiar with, the value-add from have agents write the code probably ranges from very small to negative in most cases.
On the other hand if you're trying to make changes in systems you are not familiar with, LLMs are a huge speed boost to folks with enough experience to sniff out what would be a bad path essentially via socratic method to the agent.
Obviously there are no silver bullets and no substitute for judgment. I will say though, I'll tradeoff ugly local code for good data models and interfaces any day of the week, and there is definitely an archetype of engineer that is very precious about code without good judgment on where it matters and where it doesn't.
Irl, (a) different people's ways of working with ai are a million little islands and (b) bottlenecks vary enormously by coder and codebase/task.
Also... I think our era has an intrinsic bias that change=progress, productivity, etc.
Take the "networked computing revolution" of 1990-2000. These computers did land on every desk and every pocket. They are administration powerhouses. Excellent for all manner of administration tasks.
But... what this netted out to is "change." We send a lot more emails than we did letters. We communicate a ton. Secretaries went extinct. But "administration" grew.
A university faculty typically has more admins. Companies hire more accountants, HR, project managers, etc.
Maybe administration was never really a bottleneck.
Code has a lot of this. Everyone has a road map, wishlist, etc. It appears as though "code capacity" is the bottleneck. But maybe most of those companies can't really generate much more value from more software.
Anecdotally, it seems that many mid-tier shops are migrating/ modernizing their stack, and suchlike.
I haven't heard of many belting out features, and increasing prices or sales.
Most bottlenecks are upstream of another bottleneck. Few are a "dam."
My most recent pet project is a transpiler from Wasm to Go, and I find it incredibly impressive that recent models (I've used Sonnet, Opus and Gemini, far more successfully than GPT), they're able to just pick up the project and work at all these levels:
- Go code that implements the transpiler (parsing Wasm, building an AST)
- Go code that gets generated by serializing the AST to a .go file
- Go code that manipulates the AST (to optimize it), and its effect on the generated code
- Go code that's grafted to the generated code (to implement more advanced opcodes) and how to interact with it from the AST
- C code that gets compiled to Wasm, then translated to Go, then called by Go
- Go code that gets called by this C code to implement a C stdlib
- WAT and WAST files that are used to implement the Wasm spec tests
I find this impressive because I have to think hard about all these levels, and I feel many programmers would have a problem with this.
And it's very often way easier for me to just write: "I want to generate this code, build me the AST that does it", than go "count parenthesis" in the Go code (I do have some LISP experience; it's still easier).
Feel free to scrutinize/criticize the code. Not vibe coded, but plenty of GenAI help.
https://github.com/ncruces/wasm2go
Its 'production' code because its a small browser game which has very small to 0 requirements on security and being perfect but high requirements on 'ever even doing this' and 'fun'.
The code it generated hat 0 compiletime errors. I was able to descripe 10 things to do in one task and it just jugged along solving all of them.
This doesn't need to become so much better to be useful. Its already very useful for a lot ofuse cases like researchers which have to verify the math anyway but are not good in writing code for filtering their testdata, converting them and running it.
Small websites, fun projects, helper tools etc.
But while we speak, in the background stuff is still happening left and right. More compute, better algorithm, more RL etc.
We could already be at 95% at 'ai will take your coding job' without knowing because these 5% are so relevant.
And no spelling errors either!
Also,
> Really? What duplication did you actually find? I count a few small ones in buildMounts and ReadPrompt, maybe 20 lines or so, but hardly anything worthy of such an epithet
>> embedding-shape 1 hour ago | root | parent | next [–]
>>The duplication I'm seeing isn't just "same text repeated" but structural duplication. Doing a quick 5 minute look again just to give you some pointers; runtime.MountSpec construction in buildMounts, Workdir vs aux-dir mount-mode handling, repeated one-off mount append blocks, overlay detection and so on, the list goes on. Just those should account for 200+ lines.
If you don't see any errors or problems, is it because there aren't any problems to see, or because they take a trained eye to spot?
This is nonsense. Im not a SWE but a CEO, if that were true I'd be firing without a hitch. And yet this is not the activity we see. Why is that? Perhaps merely writing code is not the entire job.
Your Product Manager is not a coding job. Your Product Owner is not a coding job.
vibe-kanban exists you could already do a proper experiment letting your PO maintain a vibe-kanban board with proper requirements and see how an agent progresses.
But 5% is often enough wwhat breaks it. Doesn't help much when your PM, PO or CEO or CTO have no clue about coding harnesses, coding agents, coding platforms, LLMs etc.
2 years ago when I prompted something, it had compile time errors left and right. Took me 3-10 iterations to even get it running.
Now its one shoting a lot. Including websides, refactorings, etc.
The question is what is missing? How far are we that it can handle huge code bases vs. smaller ones? How far are we that it can comprehend the whole architecture and doesn't try to put a service in a wrong place just becaus the context is too small?
Mythos is 10 Trillion, that might be already pushing it.
95% might be not enough for someone in sense of "yeah i can't do the 95% and i can't do the 5% either the AI can do 100% or i still need Kevin with his knowledge even if its just for the last 5%"
That has been my experience too. The days when I'm very focused, being extra deliberate and constantly questioning/examining/challenging things, the results are much better. Autopilot days just go through in a daze and the outcome is objectively worse. This has made me much more hands-on and pushed me towards models which are actually not that "clever" like codex at effort=low but fast. Given that I'm doing the meat of the thinking, might as well not be slowed down by the model and lose the flow.
I know I have struggled to keep up, and fall into the trap of approving things (either commands or recommendations) without taking the time to really process and think about them.
It's a bit like the age old problem of "it's super easy to ask questions, and can be super hard to answer many of them". So the economy of the conversation gets out of whack fast.
I think getting a decent setup with a fast feedback loop for the agent combined with context (in repo markdown)+memories goes a long way.
After having Claude Code "remember" my preferences and tools, it's more efficient.
It has a tendency to copy existing patterns so a good AGENTS.md with best practices and architectural goals goes a long way to prevent it from duplicating patterns you're trying to get rid of.
I think this may depend on the sorts of work you do. For those of us who mostly live in web using established frameworks, that's about when I came to conclude they could do everything and do it well.
I can have opencode discover third party APIs and generate fully working solutions that are well integrated into an existing long-lived codebase. I still review the MRs by hand but I only ever discover spec errors or style issues, not defects in the code itself. This was a big change from ~summer 2025.
This is a really well defined space though with strong conventions. If you're doing something more interesting YMMV.
Well... I don't know what you expect but so far I'd like all my colleagues to write code at the level of what I get from codex.
The problem is that our CEO's fear of the future that pushes them to peculiar decisions that objectively make no sense (cf the infamous discussion of the Microsoft employee on Github that couldn't force its agent to do the proper thing).
It's not the first time I witness this kind of discrepancy and probably not the last, I just learned to adapt to it.
Now I just use deepseek. It isn't any dumber, and it costs way less.
You can dig up my past comments semi-arguing with simonw where I said AI just isn't good enough yet, but lately I've been using Codex mostly just to review existing Godot/GDScript code: https://github.com/InvadingOctopus/comedot
and now I'd say that in this day and age one would have to be dumb to not use AI in SOME way :)
It's helped me catch a lot of bugs that would have taken me a long time to even notice on my own. I guess it helps that the project is modular enough where most files can be considered standalone, with just 1-2 dependencies and well-commented already, so the AI can look at each file on its own one at a time. You can see the AGENTS.md I use on that repo.
Most of my productivity in the last 3 or so months has been thanks to AI, though none of the code there is AI generated. I even bought a MacBook Neo just to use as an "AI thin client" while on travel, even though I already had a beefy MacBook Pro M2 Max that I just keep at home/hotel as a desktop now. Codex's recent remote control features have made it more useful for the moments when I get a cool idea while out at a cafe or on a walk.
I don't just copy-paste the AI's output, because it's often inefficient anyway (like creating redundant variables/functions), but I find its findings useful for manually cleaning up my shit. Maybe their training data is not that good with GDScript yet which is a bit of a jank language anyway.
So my core code is wholly made by meat, but I do have fun now and then telling Codex to make experimental games using only the library of modular components I have written so far, to test my framework and also the AI's abilities. This kind of work seems like a surprisingly good match for AI: It just has to put existing blocks together, that already have well-defined interfaces/contracts etc.
I've been on the $20 ChatGPT plan for about a year now, and only started using Codex since like maybe 4 months ago, almost always on the latest model with "Extended Thinking" or "Extra High", because I want my shared code to be as correct as possible because everything else I do depends on it, and I only hit limits like 2 times in the last 3 months.
Claude on the other hand, terrible: https://i.imgur.com/jYawPDY.png
Grok is OK for general stuff, never tried it for coding.
Gemini's UI/UX and lack of privacy and the AI itself is so terrible I tried it just maybe 2 times ever...and it refused to work on Google's own Flights website and reverse image search! (it told me to do it myself)
Deepseek refused to talk about Taiwan or Tiananmen Square so I'm not sure if I can trust it for anything else lol
I've recently tried codex, and I have it set to plan mode with 5.5 and I'm hitting the limits on a single task on a "medium" sized codebase.
Then I have a script that summarises that I usually run before pushing or at end of day.
Works quite well for both improving my code and the code ai wrote.
As I said we have a plenty of different envs, codebases, requirements. Things are complex.
You're posing it like I tried just one time. It's been hundreds of hours of tries and I just found out what works best for me, like everyone should do. My original post above isn't that hard to understand.
Let me stress this out again:
> That's why the debate is so polizered imo, there isn't a shared experience
My question is not so much about sharing a cherry picked example, but the question was more like "have you tried in earnest to make it work". That's the part that wasn't clear from your original post. But you say you have, and you weren't impressed. Fair enough. I'm not trying to convince you otherwise, but I encourage people to give the tools a fair chance before throwing up their hands and deciding it's meh.
Having said all that, you're right there isn't a shared experience.
An idiosyncrasy of humanity is that the dumbest individuals tend to also be the loudest.
F1 mechanic pops the hood of a mass-market Toyota Corolla and doesn't understand why everyone says it's really good.
A lot of us are out here building websites or phone apps.
Not to say that these things can't also be taken very seriously from first-principles, but I think that's rare.
I think it more reliably does IaC with established patterns especially when it can do a dry run.
Python is pretty decent but usually you need good prompting and a little bit of steering to prevent slop. The slop usually works tho
Codex w/ gpt-5.5 seems faster but maybe just a bit below Opus 4.7 quality.
I gave Opus access to a repl (pyrasite-ng) in a running Python process and it managed to find an 8 year old "memory leak"--a module level cache with no eviction. It did that using GC module and exploring the heap. I was pretty happy with that outcome. It would have been quite challenging for me to find myself without at least a few weeks of deep diving into memory leak hunting docs/resources.
Is there anyone in the industry noted for their skill, quality, and taste, e.g. Jonathon Blow, who is impressed and thinks the AI is really good? I haven't seen any. In my personal circle, the best devs I know are either micromanaging or shunning AI; none of them think the agents are capable or really good. The mediocre devs I know are largely on board. This applies both online and off.
Couple this with the fact that no AI focused project has come out, not a single one, that meets a high quality bar with nontrivial complexity.
I am an AI quality sceptic. They can be useful if you don't care for quality, but I never don't care for quality. I live for quality.
I agree, but you contradicted yourself just one line above.
> For generating production code even with a lot of steering and baby sitting? Absolutely not
Moreover this is further in contradiction with several facts:
1. the majority of this industry has always been composed by mediocre/bad developers, often unable to write a fizz buzz
2. the majority of work in this industry is implementing mundane CRUDs to move and transform trivial data across the organization's stakeholders and/or customer or third parties
3. there's lots of stellar and respected engineers leveraging the tools on a regular basis even on problems that are far from trivial and outputting quality code much faster than they would've done otherwise. Mitchell Hashimoto has blogged about it in his work on Ghostty, Sanfilippo has blogged about it in his work on Redis and so did plenty of others. I know several open source stellar developers who benefitted greatly from these tools, yet you think it cannot improve the quality and output of the most mundane tasks out there?
> I agree, but you contradicted yourself just one line above.
>> > For generating production code even with a lot of steering and baby sitting? Absolutely not
with this last sentence I obviously meant in my experience, it's not that hard. I don't buy your facts are highly biased towards web development, that's a common mistake here on HN to think it's the totality of the industry, luckily it's not
There's many more, from Flask to Docker, from Ruby to FastAPI or Tanstack. LLVM has integrated AI-generated PRs, so did Swift and Mojo. Sasha Levin has pushed into Linux Nvidia-related kernel changes that were authored by LLMs in 6.15. You can be certain there's a magnitude more where people don't admit or tag their PRs as AI generated or co-generated.
In fact I am quite confident that projects and developers that are not leveraging the tools are increasingly rare. There's really no reason in 2026 to write a non-trivial PR and not ask a cheap review to an AI tool.
The industry is changing, I don't really like the trends I'm seeing, but to state that LLMs cannot and are not writing production code, very often quality ones, (especially when used, setup and overviewed properly) is plain denial.
Your anecdotal experience isn't relevant, especially when applied to the largest parts of the industry, composed of mediocre developers working on terrible codebases.
For someone that just dabbled in coding prior, it went from AI building 80%, and struggling through to finish the 20% when trying to build an app/website.
now it's like 97% and struggling with last 3%. Yes it'll look rough around the edges when evaulated by a senior dev, but being able to build MVP level things to completion with ease helps you stay engaged and motivated to continue and learn.
Who needs to generate a dumb demo of a 97% done crud app? We had code generators for those, everytime I read claims like that and I ask to explain further I then discover it's people who were not productive before generating the so called "MVP level things to completion with ease".
If you're trying to solve a HARD problem people REALLY have, it's a novelty that agents can't help with, otherwise if it gets 97% there MAYBE it's just a signal that your idea isn't that novel!
LLMs can effectively validate your business idea
To give a specific example, 12 months ago I had a client pay me me to make a Chrome plugin that changed the rows in his Shopify Products page to display Quantity and SKU.
These days you'd just one-shot it in Claude.
I've been lucky enough to work at places with majority intelligent engineers with similar tastes on quality to my own... but it seems to be that's not the norm or the case everywhere.
and it's the 90% that's most vocal. Sturgeon and D-K seen to go hand-in-hand.
If these people had a burning desire to build things prior to LLMs and couldn’t put in the effort to learn to build them (which is also fun!) then why would they ever put the effort into anything to understand it and make it good??
The answer is "for lots of people, but not you".
You're doing a vague impression of being fair and even-handed, arguing for non-polarization, but underlying everything you're saying is an obvious attitude of poralizing superiority: That _your_ personal experience with AI is the real truth. That _your_ codebase is more intricate and more challenging than what other people are doing. That everyone else is being led by a "marketing hype train".
> Absolutely not, not quite there not even close in my experience.
I obviously mean in my experience, not the real truth.
> That everyone else is being led by a "marketing hype
That is obvious instead, and I later say there's not 0s or 1s, every job has his intrincancies
Yes, there are ways to convert raster images to SVG for use in training data but it's not a good use of anyone's time.
Mistral seems to be the exception. Their new model from a few weeks ago is worse then selfhosted gemma.
- Memory market cornering which mitigated the adoption of local AI despite great open model being released.
- Fast penetration of IP exfiltrating tools in companies world-wide.
- Developers producing more code that they can read.
- Autonomous agents killing Open Source by siphoning the attention economy
- Autonomous agents destroyed online communities (including HN)
- Autonomous agents being used in warfare (targeting, propaganda...)
- Widespread vulnerabilities discovered, Widespread supply chain attacks.
- Increasing inequality, fracture in perception, Green indicators, Grim realities.
Medicine has done amazing things in my lifetime.
See you in 10-30 years when people are still dying of the same shit as today like oesophageal cancer and glioblastoma.
Maybe in the next century but by that time you and me both will be under the ground, and no, Amodei's doubling of human lifespan simply won't happen.
Ideally we'll come out of the AI hype cycle having learned better practices.
This is a good thing
This is a bad thing.
Wait, what? What is that?
> - Fast penetration of IP exfiltrating tools in companies world-wide.
That goes on the benefit side, I believe.
> - Autonomous agents killing Open Source by siphoning the attention economy
Anything attention economy disappearing is a "good riddance" to me.
i believe they are just saying that RAM prices went crazy
Hopefully she rejects all this out of hand, but if she doesn't it'll mean that none of our trainees get the benefit of her experience, who she is as a person, and what she has to pass onto them.
We have 6 monthly reviews as instructors where we are told the same thing. "How could you use AI for your teaching?"
They don't even feel the need to justify why this would be desirable, or is needed at all. It's just pure bandwagonning. Unbelievably, most of my coworkers are extremely positive about AI, although none of them have told me they use it for anything besides preparing their lessons for them — they just use it instead of having to think, or spend time preparing...the only important thing they do at work.
It makes no sense to me.
I have to consciously avoid using AI for more cognitive tasks, though. It would be very tempting to have Claude, ChatGPT, or Gemini summarize, classify, and grade the students’ assignments, write individual feedback, prepare my lesson plans, etc. However, I know that my engagement with the material and with the students would suffer. I also want to show the students that they are learning together with me and with each other, not with bots.
I am semiretired and have a light teaching load that gives me plenty of time to prepare for class. I can see that full-time teachers might find it hard to resist the lure of offloading their thinking to AI.
That gives me a starting point. Of course, I modify it. Maybe I bounce back and forth to the AI for further refinements and suggestions, but ultimately I have to be happy with the result.
When prepping the individual lessons, the biggest time saver is coming up with examples to illustrate particular points. I could do this alone, but sometimes that involves staring at a blank screen for a while. It is faster to ask the AI for suggestions, pick the one I like, and refine it further myself.
AI is a tool. Use it appropriately.
Yes, but no room is made for people who see no use for it. There is a forced-consensus that this technology is useful, which I have to combat against at work.
We teach in a very different environment, but your use sounds typical of my colleagues. "I ask it for suggestions and pick one", but nobody seems to wonder about what is lost when we shrink the horizon of what we will teach to the most likely outputs from a chatbot, one of which we will use.
Maybe this makes more sense in other fields. I have to prepare people to work in the shipping industry, in extremely dangerous roles where they will be operating heavy machinery, steering ships, driving cranes etc. The fact is that AI knows next to nothing about this field because an AI cannot experience handling a ship in rough weather, has never secured a boat to a ship's side with the rain and wind in its face.
Yet, when people are brought in to instruct our trainees, they are told to "tell AI what you want and pick one of the suggestions", in the best case, or just give over everything to the AI in the worst case. And nobody seems to be able to explain why this is a better way of working than sitting with a pen and paper, brainstorming some ideas for a lesson based on your real experiences, and then delivering it. The only justification I'm ever given is your one, "I pick from a list so I am really still in control", "it's quicker and I don't have to think as hard or as long", "it's better at making slides or writing good-sounding (to management and auditors) lesson plans". No-one ever seems to justify it by saying it is genuinely a better experience for the trainees.
- pre GPT-5.4: very limited use; some smart people got some mileage out of the models, but it always required serious work and a very suitable problem. Of course the models could solve homework problems, but that felt more like a downside to us who teach.
- since GPT-5.4 (Mar 2026): the "wow" release; suddenly answering MathOverflow-level problems that have previously been stumping experts. Still prone to hallucinations, but smart enough to use the built-in Python skill to verify its claims on small examples when possible. Probably a lot better at formula-heavy math than at the abstract "philosophical" kind.
- GPT-5.5: gave me a fascinating, significantly nontrivial and highly instructive "proof from the book" on an MO-hard problem that I'm in the process of writing up. Might have been luck and good prompting, though. Didn't really feel like a qualitative leap from 5.4, but I take quantitative any time. Still requires suitable problems, but it's much harder to rule out suitability from the get-go.
Claude and Gemini have been also-rans the whole time and still are. I use Claude for secretary-like tasks; occasionally it finds an easy proof too, but usually because I've missed something obvious.
Oh, and GPT, and to a lesser extent Claude, are great at hunting errors in maths. Probably 90% of my prompts so far have been for proofreading my writings.
The average office worker is amazed at Copilot (not in the IDE - but the app bundled with Windows), and they mostly copy paste material into their enterprise provided ChatGPT / Gemini, and get tips from Facebook / Instagram on their top 5 best prompts for work productivity
Showing them agents that automate work at scale is a very magical experience
I use it a lot now for knocking up grafana charts etc. It’s not so much that the LLM is feeding the numbers through. You can still use real tools to analyse and summarise the numbers, it’s just much quicker at driving them.
As ever with data analysis, two things will continue to be true. Real insights come from spotting something that looks off and digging into it deeper. Secondly, it’s really easy to connect data in a misleading way.
I’ve had a Claude analysis handed to me this morning including a summary list of actions we’re going to take next which falls into this very trap.
The insights you’ll get from your data will only be as deep as the curiosity of the person at the helm.
I find it’s easier to version control and diff the .md artefacts, those remain my authoritative source.
If you are a bit technical, reveal.js is actually really nice for this. I one shotted a pdf export for that uses a headless browser. I've used that a few times now.
What works well for me is to take an existing presentation and then some raw input and generate a new presentation in the same style as the old one from the raw input. After that, I can go in and tweak individual slides.
Another thing I did recently was take somebody's existing pitch deck and fix it with a one line prompt: "this deck is a bit meh, pimp it!" that worked unreasonably well. I like using shitty prompts like that. Codex often manages to do the right thing if you don't overthink your prompts.
Classic deck of somebody that used way too much text and only bullets. It did a great job on that presenting the content in a more simple and better structured way. Pulling out key facts and highlighting those, simplifying text, etc. Doing that manually would have taken hours.
The important part is the presentation matching your presenting cadence, which is something LLM generated presentations never get right. I don't have a problem with people generating presentations, but most of the time they just end up reading whatever is on the screen when presenting.
Personal: my wife tutors in her native language to non-native primary and high school kids. They are all using these tools now generate fresh content for practice based on school lesson plans. The kids are improving much more quickly now than they were just a few months ago.
Thanks!
Once I was going to send some figures to leadership so I checked the queries myself and not only had it done it correctly, but it had also included a lot of sanity checks with other places in the database which as a human I doubt I’d have had the time or inclination to do.
Even for modelling work it can be good to check your ETL queries, or write one itself and then check it etc.
We have whatever AI is in teams transcribe every meeting, and it's scaringly good at it. It's also extremely good at sumerizing or finding things from pervious meetings when tasked. One disadvantage in this, is that I can see how stupid I sound on writing. I'll go "yeah, hmm, yeah, that's, yeah", but it really is pretty good.
I assume we're going to see a massive increase in AI with this Cowork inside the Microsoft client. We actually have a better tool available through a librechat where you can create and configure your own agents with the same filesystem access to your one drive, and a lot more tools and models than just Claude. Almost nobody has been capable of figuring out how to use it though, so they've been using the regular office365 copilot and it sucks so bad that a lot of people stopped beliving in AI.
It's ironic that Microsoft fumbling the ball on AI, but being very good at enterprise customers (especially non-IT) means that they'll likely be the company which is going to sell us AI tools that people will actually use. I have no idea why it's so hard for people to pick up the Librechat tool we're given access to through our equity fund. It's quite litterally a copy of ChatGPT where you can point-and-click configure an agent, but we're seeing that even employees who use a lot of ChatGPT privately don't use this tool professionally. Meanwhile everyone has been capable of using the Microsoft thing (that I personally think is less user friendly since you will need to add your configuration files to every promt).
That's because M365 is integrated with the whole Office/Exchange environment, especially in terms of security policies, etc. MS also guarantee that the data are private, this is very important for many companies both from the IP protection perspective and the liability to expose some users/customers data (think of GDPR regulations is Europe).
I don't know who is behind Liberchat, probably some good and friendly folks, but when it comes to privacy/security Microsoft has much more to loose and if shit happens it is easier to sue them than some random VC-financed company from the USA.
That’s in a finance shop. I’d imagine it’s different in programming shops where handing people Claude code is a bit more plausible
Some of these are now contributors.
I also have a friend (beware, N=1 study) with zero prior programming knowledge that has released his first app.
A year and a few jobs ago I was genuinely up against a wall I could not see breaking through, not if I wanted to ever sleep again. Hundreds of completely bespoke customers. Hideous archaic tooling. Two of us. It was bad times. So I started paying for Claude - desperation move, to try and vibe my way out. Honestly, it's been a little bit like having superpowers.
Not just code generation, which has been great, but gaining knowledge and understanding with incredible velocity - sort of like how RSS felt back in the day, or when Google stopped being worthless in the very end of the 20th C. When Wikipedia started.
So where am I now? Well, I ditched the hell job (I didn't really drink the koolaid of their "Enterprise Solution" anyway), and got a regular day job in my core competency. I guess I do a lot of what is called "vibe coding", all kinds of utilities, what I call my "extracurriculars". A graph view for Asciidoc in VSC to show includes, xrefs, partial includes. Graph view for everything actually - it's surprisingly insightful for PDM and config management. Analysis tools for sensor faults based on Python open source astronomy tools. All sorts of converters and aggregators and cleaners for a devil's piss bucket of enterprise systems. A bazillion new MapTools macros for gaming, making complex RPG systems nearly pushbutton. A little harvest of local LLM systems doing all sorts of things, like my "Reviewinator" for copy edit. I could type the rest of the day and wouldn't come close to the end of the list.
So, pretty amazing. Very interesting systems with what must be some N-dimensional geometry underlying, maybe a signal to an underlying principle of emergence. Who knows?
In the long term, it's going to be Enterprise Software that eats the big losses from these systems. For all sorts of reasons, but mostly because Enterprise is where software goes to die. It's all bespoke to hell, it's all ancient, no one is working there because they want to. So a domain expert, with AI assist and a little know how, is probably going to whip up a superior set of tools in a short enough time to make it really worthwhile. Watch that space: SAP, Siemens, Teamcenter, SalesForce. Watch their consulting revenue.
Sales will be another big user of agent automations, for better or worse. Poor usage by Google to craft emails and slides for us is why the suits are getting an Anthropic sub. Stay human in the loop my friends!
I'm not sure that's true anymore considering how popular Simon's blog is
> I think this mainly demonstrates that the pelican on the bicycle has firmly exceeded its limits as a useful benchmark.
As acknowledged in the article.
https://gemini.google.com/share/55e250c99693
> Why this test? Because pelicans are hard to draw, bicycles are hard to draw, pelicans can’t ride bicycles... and there’s zero chance any AI lab would train a model for such a ridiculous task.
At this juncture I'm left wondering why competing AI labs wouldn't train for this now well known "test".
I've heard the same has happened with common benchmarks (they've ingested solutions into training data)
https://grok.com/imagine/post/8d1eab88-737f-4d46-ba92-9b6502...
Interesting that it does better at making the pelican peddle in the video generation than in image generation.
The length of the pedals keeps changing, and you'll notice that neither of the pedals actually rotates around the hub: consistent with your point about the center of gravity being too far back, the circle the pedals are making is also shifted back too far.
But there’s a lot of panicking, fear-mongering and all sorts of nonsense around this whole subject.
The thing is the creative economy is all about people’s attention and pocketbooks, it doesn’t need to be great just good enough.
When advertising agencies for example see that their copywriter can go from idea to concept with a video generator instead of engaging an animator, they’ll simply cut the middleman who used to create that animation for them and use the tool instead, even if the content isn’t as good (though the quality of this one is really pretty good, there are obvious problems). They’ll happily accept mediocrity to save money.
People will still create adverts but quality and creativity will go down and a lot of jobs are going to be suddenly displaced.
I suppose it is more the latter, and it's the artistic people who create stuff who will suffer. The ones coming up with ideas, but previously couldn't create becasuse they lacked skill might win thanks to AI.
Coming up with ideas is easy, creating and putting in the effort is hard (until we had AI).
Probably the value of created stuff will go down rapidly because there will be so much of it.
And looking at the trajectory of the animation industry, I don't think increases in productivity will be used to raise the quality of the animation if the alternative is to just pay fewer animators
But yes, for anyone who does this for a living there will be obvious deficiencies, esp when you try to do something truly novel, intentional and interesting and don’t quite want what it produces.
But in this area they have made quite a lot of progress.
The half-full view is that the models are so good at finding vulns that if you plug them into your build-pipeline then the amount of new vulns introduced will go down towards zero.
The half-empty view is that we're now producing more junior-level code with less review, so everything will have more vuln, also it's cheaper and easier to find them so prepare for chaos.
Short term there is sure to be chaos either way as the models are clearly good enough to find all the old bugs, and not everyone has the resources or will to try to stay ahead of the curve like Mozilla is trying to do with their Mythos access https://blog.mozilla.org/en/firefox/ai-security-zero-day-vul...
A threat actor with access to a better model or more money to burn on tokens may yet find more. Some of them have deep pockets, and not nearly every project will get the Glasswing treatment of free Mythos tokens.
Systemically this usually favours the offence, as they could scan my app once every 6 months whereas I'd need to do it on weekly releases.
We're most likely entering a year or two or rapid vulnerability discovery, patching, as well as reducing and minimalizing system footprints just to survive the onslaught of strange vulnerabilities from e.g. ancient and widely unused kernel modules.
I met a few people at PyCon this week who have been part of Glasswing (they're just starting to be allowed to talk about it) and it really does drive down the cost of finding vulnerabilities.
I've been collecting notes on that here: https://simonwillison.net/tags/ai-security-research/
Well, a combination of that and believing that replication of test data is a good measure of progress.
At the same time failure proves little because most humans also could not manually create a correct SVG of a pelican riding a bicycle.
What is it exactly that such a test is testing?
In which situation would you measure the "competence" of a human being by asking them to write an SVG of a pelican riding a bicycle?
+ Developers are more productive, but are you all leaving work at 3p and enjoying a new found sense of work-life balance?
+ Companies are investing heavily in AI, yet I'm paying more for the same thing. Jamie Dimon still pays me 0% on my checking despite spending billions on AI.
It may be that simply adopting AI isn't enough. Could new startups that are born-in-AI buck this trend? I wonder what Clayton Christensen would say if he were still around.
EDIT: Schmidt's booed commencement speech was probably one of the most out-of-touch speeches (outside of a tech interview) I've heard.
I think that's probably why everyone in this thread has such different experiences - someone whose workflow is mostly asking a model for code and pasting it in would have seen modest improvement and would reasonably wonder what the fuss is about whereas someone who was already running agents on 20-step loops would have felt a much bigger shift, because the thing that used to kill those runs was the failure at step 12 cascading into garbage by step 20, and that got a lot better.
The local model story Simon kind of glosses over is interesting for the same reason - a 20GB model drawing a decent pelican on a laptop is a cute data point in isolation. The thing worth noticing is that a competent local model inside a good harness now gets you closer to frontier performance than running the frontier model without a harness does.
Then the nerf, and the massive uplift in tokens for 4.7, a model which I find lazy and prone to hallucinate.
It's probably time to try GPT5.5. Like many I'm pretty heavily invested in the anthropic ecosystem at this point, which I suppose gives another strong reason to make the switch.
On the other hand, this year I’ve been in the habit of using codex as a bug finder / audit layer, where it shines, and I can tell you, Opus makes a lot of mistakes, and as we all know struggles with laziness — and has gotten good at encoding that laziness into the codebase (// Per instructions, pass this test by default) where it can live for a long time. So, Opus had spoiled me, but more with its ability to sketch holistically than its ability to put out perfect codebases.
Upshot - it was good to switch horses for a while, as you mention. Slightly different skill sets there. And I still reach for claude especially for initial design. But right now the daily driver is 5.5 / xhigh fast mode, and it’s very capable.
ChatGPT 5.5 seems capable, although a bit stingy with “thinking” compared to earlier models, and I never run into session limits.
Even operations and GTM are all at "professional" level (which I think is vaguely equivalent to 5x).
> there’s zero chance any AI lab would train a model for such a ridiculous task
Well, I think this guy's tests have got enough visibility that I wouldn't be surprised if some AI models are trained on it specifically...
I'd like to remind everyone here that people on this forum used to actually code truly remarkable and pointless stuff like this, with zero LLMs, using nothing but their brains and motivation from who the heck knows where from.
It's like most people just watching a 'starting nba player' (not superstar, but just starting player) vs one that sits on the bench.
If you were to just watching them play, work out, shoot - you'd never notice the difference.
Put them head to head and it's 98-54 and you start to see the patterns.
It's pretty interesting actually, someone tell me what the 'science' for this is, I'm sure there is some kind of information theory at work here.
Software has innumerable kinds of problems at varying level of complexity and so it provides the perfect testbed for seeing how far models can go in practice.
Should add: you're very right to hint that harness, tooling, and models tuned o both the harness and he kinds of things people do on the harness, as well as some other things do make enormous difference.
Bu and large, SOTA Codex/Claude Code are substantially better - at least for now. That may change.
Personal opinion we need to focus more on efficiency instead of how large or complex a model can get as that model creeps into more resource requirements. If the goal is to cost a billion dollars to operate than we've really lost the idea of what models are supposed to be achieving.
I've certainly had things that Opus fixed using some kind of work around that GPT-5.5 actually solved.
And the difference between the Sonnet/Gemini/DeepSeek tier to the Opus/GPT-5.5 tier is immediately obvious.
The big benefit of automating this for so long is that I have lots of data. I analyzed it and found that I can change the models out without much of a change in the output quality.
For one-off tasks, where there is no harness and you're just YOLOing with the TUI, yes, big difference. You need a harness.
The pipeline controls the quality far more than the model, empirically.
Because you have to adjust the harness to your problem space and provide that so you can say it is high-quality.
Many people will stop that discussion at the claude code vs. codex vs. opencode level and then merge that with discussing model performance.
And that is also why "Generate an SVG of a pelican riding a bicycle" is still a benchmark worth discussing. Because at least it is a defined problem space.
I feel like if anything people started to realise the significant limitations of LLMs when you try to use them as ‘agents’ which was the big direction LLM companies tried to push recently.
Best use of LLMs so far IMO is finding vulnerabilities (with human help) and pattern matching in other domains. For generating code and prose they are still mediocre and somewhat unreliable and for use as personal assistant agents I wouldn’t trust them.
So what’s happening with openclaw, the biggest experiment in agentic, vibe coded by the agents themselves? The thing that was so hot a few months ago.
https://github.com/openclaw/openclaw/pulse?period=daily
279 commits to main from 77 authors in the last 24 hours.
Why is there so much churn and how could you trust it with your data? This is changes in ONE day!
If these are useful changes, surely it’d be superhuman by now given months of this pace.
What are people using this for?
Well... Now I can be that client. And let AI deal with my incomplete, always changing requirements. And get it done anyway.
The thing headline numbers ("AI made me 3x faster") hide is which 30% of the work the AI sped up and which 70% didn't move. For a solo dev the survivable bet got smaller, and that's the real change, not raw productivity. AI made certain projects worth attempting at all that wouldn't have been viable six months earlier.
Waiting for the next event at this point. Hoping that "inference becomes cheap" when Groq hardware gets delivered.
Personally, the more time I spend working with coding agents the least worried I am for my career. Getting the best results out of them is really hard. They amplify existing skills and experience, so the more experience you have the better.
I wonder why there is such a mad dash to trump up the capabilities of coding agents. And why such loose terminology and lack of rigor? I thought programmers were supposed to be rational people (har har!)
I have a theory: if they were good at writing automated tests, they would have been developers instead of QA engineers.
Not saying that there aren't any high quality QA engineers, I worked with some. But LLM's raised the bar in a way that most QA engineers can't reach.
The best QA people I've worked with didn't write much code at all. You'd give them a new system and they'd find all of the bugs, testing obscure edge-cases that you'd never thought of.
In my limited experience they write test cases, test each story, do regression test, verify bugs from customers. All by hand.
At my current job I don't want to miss them.
What are you going to tell them? Suddenly you're earning what they're earning for sitting at a desk every day?
Non SWEs (salespeople, clerks, secretaries, assistants, taxi drivers, writers, 3D modelers, artists, designers) are of course going the same way. Unless they are protected (unionized or such), why would they have sympathy for SWEs? People of our ilk are the ones causing this (to them and to ourselves). What I will tell them is to not repeat our mistake, organize and protest.
AI reduces the cost of producing software (and other intellectual tasks), which greatly improves the viability for more and more ambitious projects. As far as we know the amount of problems software (and humanity) can solve is unbounded
It feels like the market has shifted in SWE yet again to heavily prioritize a new set of skills, of which those in the top quartile are desired more than ever
Fundamentally, steering LLMs requires the same structured, logical thought process that is required to write code, regardless of abstraction level. Unlike what HN would have you believe this is not a skill that is equally distributed across the population.
But given the rapid pace at which this technology is evolving, "steering" may very well be ceded to the clankers. LLM agents are fantastic at logical reasoning & any inefficiencies relative to human experts can be circumvented by sheer compute.
Implying another country has a better model? I'm being pokey here because I'm very curious! I know Gemma is efficient, but I also remember Qwen and Kiwi being referred to as optimized. The difference being that Gemma is using less tokens, but maybe Qwen/Kiwi's quality is higher? I dont know.
Is there a video or audio of this talk?
Opus 4.5 hit that point in November.
They were able to one-shot famous games (like asteroid or pong), I suspect because they had been trained on multiple versions of that game. So like producing Harry Potter, with the right prompt it was able to produce a license stripped version of code it had seen. I tried another arcade game like frogger and it failed really badly and took a lot longer, never got it working.
The whole exercise left me feeling they have a long way to go, I don’t see how anyone could think they would replace SWE unless they didn’t look at the code produced, even now.
“You’re going to make frogger in javascript. I want a complete clone of functionality for level 1, with amazing 80s era pixel art sprites. I’m super lazy, so you’re going to have to test everything, right from the start. Pick a test harness, write the tests, including tests for having amazing graphics, gameplay, input, UI, sounds, etc, and write a full workplan, then work through that workplan, in parallel where you can. The workplan should emphasize getting a stripped down version up immediately and have workstreams for all the major requirements after that. Add a final test that assesses how fun the game is by reviewing a real video of a test run. Loop on that final test until you can’t improve things any more.”
Should produce something playable with no further input. As you say, I’m not sure it would produce a codebase we’d want to look at or work on. But, I’d be surprised if this weren’t successful.
Should be the pelican bounced off.
It is getting very good at producing code that compiles - at the algorithmic level.
This is definitely noteworthy - and the AI is crossing a critical 'productivity threshold'.
But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.
Everyone has tried to get AI to 'Draw The Thing They Want' and you notice immediately how it's almost impossible to 'adjust the image' along the vector you want - because ... and this is key:
-> the AI doesn't really understand what a Duck is, it's components, or fully how it made the duck <-
It just knows how to 'incant' the duck.
This becomes very clear when you try to get the AI to write proper documentation - it fails so miserably, even with direct guidance.
This is really strong evidence of how poorly the AI is generalizing, and that it is not 'understanding' rather it's 'synthesizing' from patterns.
We already kind of knew that - but we have not yet built an intuition for that until now.
Only now can we see 'how amazing the pattern synthesis' is - it's almost magic, and yet how it falls off a cliff otherwise
This has deep implications for the 'road ahead' and the kinds of things we're going to be able to do with AI.
In short: the AI is 'Wizard Level Code Helper, Researcher, and Worker' - but it very clearly lacks capabilities even one level of abstraction above the code itself.
LLMs were first trained by 'text' and now ... they are 'trained by our compilers'. Basically g++, javac, tsc are the 'Verifiable Human Rewards' in the post-training and reinforcement learning - and the AI is getting extremely good at producing 'code that compiles', but that's definitely an indirection from 'code that does what we want'.
It's astonishing that it took us all this time to internalize and start to discover what I think will be in hindsight a very obvious 'threshold' of it's capabilities.
We are constantly 'amazed' at the work that it can do, and therefore over-project it's capabilities.
I have no doubt that even with these limitations - the AI will unlock a lot more as it gets better - and - that it will 'creep up' the layers of abstraction of it's understanding.
But I strongly believe that the AI is going to get much 'wider' (pattern matching dominance) before it gets 'higher' (intrinsic understanding) - and - that this may be a fundamental limitation.
This may be 'the Le Cunn' insight - when he talks about the limitations of LLMs in detail - I believe this is that insight writ large.
Even the term AI - or certainly 'AGI' may be a misleading metaphor - were we to have always called it 'Stochastic Algorithms' or something along those lines, it's possible that our intuition would be framed a bit better.
The most interesting thing is how it is definitely amazing, world changing, novel and powerful and some ways - and obviously useless in others at the same time. That's the 'threshold' we need to better understand.
That might be the case, but Simon's case "Generate an SVG of a pelican riding a bicycle" is very different.
The model actually has to understand what parts of a pelican and bicycle come together in something like an anatomically plausible way. That's a higher level of abstraction than something like passing the same prompt to Stable Diffusion etc
(The new Nano Banana/GPT Image 2.0 models are different though - they have significant world knowledge baked in)
No, it's not because it's seen 'anatomy' for Pelicans, Animals - even how it's represented in Animals.
If you try to get the AI to actually decompose it and start to 'draw pelicans' in very obscure ways, it will immediately fail.
Try to get the AI to draw the pelican form a very odd angle - like underneath, to the right, one wing extended, one wing not ... 0% chance.
Precisely because it does not understand those things.
FYI it's a slightly unfair case because it does not have 'world model' yet, which will actually solve that problem, but even then not through very much abstracting.
We're a long way away - but in the meantime, there's lots to unpack.
Proof by existence?
https://gist.github.com/nlothian/50241d34a654fcf0caa280d4475...
Looks pretty good to me. ChatGPT in "Thinking" model.
Edit: I've added the Opus version on the same link.
https://chatgpt.com/share/e/6a0bf28b-e198-8012-9a88-c777d965...
When it was new, sure. Right now, models can be trained on that because everybody uses it as a benchmark.
The model does not 'understand' comprehensively the relationship between anatomy, dimensionality, etc..
humm
Hmmm......
Does that suggest the uplift was only for things that are easily verifiable like code?
The hope was that good RLVR on relatively contrived datasets (like benchmarks) would be generalized to good software taste, which has somewhat succeeded but also the models fail in horrible ways still
And the hope beyond that is that good skills in fundamental problem solving tasks (coding, math) would generalize to tasks beyond math and code, which did happen but less so
Other domains I am not sure but I've heard from people like Cal Newport that the rate of increase outside of code and math are not as equally impressive
It would support your point about the performance of 20GB local models.
Hmm, given how small the nerd community is and how often I met that task either on Hacker News, or on various Substacks, I am not so sure that the AI labs would ignore it completely.
The size of the codebase doesn't matter anymore. In fact, I am finding that the larger the codebase the better the performance. Starting from scratch with vague ambition is not the same as solving a specific stack trace over a mountain of decade-old code. The later performs better and is also more exciting for the business. It would seem more callers = more constraints to verify against.
For the last 3 months I've felt like I've been dropping gps guided bombs from orbit. No one can tell the difference between AI authored and my hand written code, other than via the implication of the radically increased daily work volume. There's definitely AI in there, but it's like a homogeneous cybernetic blend of my work and the computer's. I own all of it, can explain all of it immediately, but I only wrote maybe 10% of it by hand.
The development team should be mostly "solved" by now with regard to the AI transformation. If you are still at Home Depot picking out your proverbial hammer, it's time to start heading for the self checkout. The rest of the business is where the real money and headlines will be made at this point. AI writing code is ancient news now. Custom harnesses that business people can use to automate workflows will print a lot more money. Bringing some bacon to the rest of the business may also help to preserve your career path in these uncertain times.
Remember what Jobs said about the customer. A lot of times, people don’t know what they want until you show it to them. Most people wouldn't have believed the iPhone was even remotely possible until the moment it was publicly revealed and made available for purchase. I am finding the same effect in the business with AI. What it can actually do when well engineered and applied to the domain will usually outperform the expectations of its users by a wide margin. All these fears about alignment, hallucinations, cost, ethics, the environment, my ego/career, etc., seem to melt away like some kind of luxurious chocolate once the performance becomes clear to the executive staff. I was able to convince the board with an unsolicited, 5 minute demo I didn't even personally deliver. I've never seen these people sign contracts so quickly.
I haven't looked into any sort of "agent" mode, just because I don't yet quite trust the AI not to do something dumb. Also, I don't use M365, where Copilot is integrated, so I suppose I would have to set it up myself.
Is the only choice to pay for the "max" plans?
Or just read so much about it that you bs your way through an interview and then use the company's resources?
Simon, I'm curious too how much you invest each month researching all the latest and great AI tech?
They're on par with Claude and Codex imo - when you still design architecture and know what the output should be. Claude and GPT 5.5 need less guidance with vibe coding, but we're not yet at a point where that's sustainable anyway even with those models.
Once I felt I had some confidence on what the spend rate would be, I bought $20 USD worth of credits and would occasionally point my editor at a cheap paid model for some real-time questions.
I've still only spent less than $2 in credits so far, as often a free model can answer my question fast enough.
I have not yet tried agentic coding, but at least with OpenRouter API keys it's trivial to cost-cap keys so you can pay for lower latency and still cap your spending.
Z.AI, Moonshot.AI, Xiaomi, Minimax, Alibaba all have coding plans that allows a massive usage of GLM 5.1, Kimi k2.6, Minimax M2.7, Qwen 3.6 Plus, Xiaomi MiMo v2.5 Pro for cheap.
Pair those coding plans with the harness of choice including Claude Code and you are good to go.
Also, Nvidia is offering free access to top models for free through NIM - but you have 40 RPM limits. https://blog.kilo.ai/p/nvidia-nim-kilo-code-free-kimi-k25
"Coding agents got really good - here, a bunch of non-releavant slop-pictures of pelicans riding bikes as a key benchmark AND a couple of hardly relevant edge-case demo-projects of mine to prove it right! "
Come on man, where is the AI writing all the code in 6 months? We're close to June and Amodei's latest statement from January does not look like going into fulfilling over the next weeks, does it now?
"and then you have to get a mac mini, and then, and then"
smile and nod, it pays weekly
Wake me up when we have an agent with constant learning and changing weights that I can have personally, not some LLM that can always fall prone to jailbreak and context injection attacks.
You think most of this stuff here is organic? Oh boy..
It can either help you conquer the world if you were already doing that anyway or it can make you spend your life in a cave before throwing you into a fucking volcano.
It's a winner-takes-all karma prize for being first to post the article.
This causes a rush of people to post.
HN has a mechanism by which duplicate submissions count as upvotes toward the first submission.
This is a positive feedback for the desire to be first, which increases duplicate submissions and in turn the karma reward.
This effect means that good blogs stay well upvoted. This isn't altogether a bad thing, but it does mean some blogs require a string of poorly received posts before that effect wears off and people no longer rush to be first.
One way to fix this would be to attribute all karma to user simonw himself ( and do similar where attribution to an HN user is known. )
When he resurfaced in my feeds as an AI commentator it took me quite a long while to join the dots that he was the same person!
We detached this comment from https://news.ycombinator.com/item?id=48189072 and marked it off topic.
As time progresses one now has a yard stick to measure against progress. No more excuses - show me the money baby.
How are these even graded? Qwen3.6-35B-A3B gets high marks for a pelican with a gaping hole in its bill?
edit: Just noticed its feet are disconnected from its legs as well (but right on the pedals!). Pardon my French but that's Chinese af.