Back to News
Advertisement
Advertisement

⚡ Community Insights

Discussion Sentiment

73% Positive

Analyzed from 20096 words in the discussion.

Trending Topics

#code#more#don#models#claude#lot#model#coding#quality#using

Discussion (579 Comments)Read Original on HackerNews

romaniv1 day ago
> there’s zero chance any AI lab would train a model for such a ridiculous task.

A lot of people here stated that this is a ridiculous metric, but no one seems to remember that it was introduced in the initial GPT report ("Sparks of Artificial General Intelligence: Early experiments with GPT-4" [1]) by Microsoft about 3 years ago. Shortly after that it was parroted by a network of booster accounts and became a thing every clueless AI hype peddler does to "test" models.

100% marketing, 0% science.

[1] https://arxiv.org/pdf/2303.12712

godelski1 day ago
For those curious, Simon's first public usage of it is Oct 25th, 2024[0]. While I'm not aware of any specific "pelican riding a bicycle" prompts being tested in a paper[1], the GPT paper did several SVG and tikz tests and the actual image is rather arbitrary. You wouldn't want to optimize for a singular image but also if you're doing halfway decent training a pelican riding a bicycle shouldn't be too hard to draw, and well... you can see several good examples if you look through different pages on [0].

[0] https://simonwillison.net/tags/pelican-riding-a-bicycle/?pag...

[1] I'm sure there is because of Simon's fame

joe_the_userabout 23 hours ago
My own informal test when generative AI came out has been "a picture of an old man riding a bicycle over a river". I just ran it for chatgpt with the standard model I have (5.5). It shows the old man on an old bicycle with the bicycle on a slack line and the slack line extending over the river with a medieval village in the background.

The point is that the prompt has a subtle ambiguity - "how is the old man going over the river?". My sense is that most humans would quickly imagine a conventional bridge with a road on it leading over a river and with the river background being in an area developed enough to allow bridge going over it.

So the implication I draw is these things can find/generate stuff that roughly satisfies the conditions (and are getting better at this) but they still fail add the assumptions that people would draw.

So my conclusion is that LLMs are getting better and better at "what they" but there are going to be places where they fail to satisfy human common assumptions.

_carbyau_about 21 hours ago
> but they still fail add the assumptions that people would draw.

I have mixed feelings about this. I agree with the default assumptions you have as to "what people would draw", however what do you want from this cognitive automation?

Do you want, "what most people would do" or do you want "something creative, an outlier, that still satisfies conditions" ?

dietr1chabout 6 hours ago
Reminds me of that dad teaching their kids programming by preparing a PB sandwich [^0].

Solvers are generally really good at bending your rules, but in a context where you want that. An outlaw rule-bending maniac is not what I want from a helpful agent.

[^0]: https://www.youtube.com/watch?v=mrmqRoRDrFg

portmanteurabout 18 hours ago
I would want to know the LLM has a reliable and realistic World Model underneath all of the next token prediction.

Whether I am building hardened engineering systems, or discussing cooking methods, or discussing sensitive health concerns, or navigating complex psychological and interpersonal issues, the model will inevitably have to make some assumptions about context I haven’t provided. I want to know that those assumptions are grounded in reality.

For what it’s worth, a slack-line over a river in front of a medieval town is too anachronistic to be interesting, let alone the idea of an old man riding a bicycle well enough over a slack-line. That is output that was not grounded in a solid world model, regardless of how “creative” it was.

godelskiabout 18 hours ago
I think the point is that language is compressed. There's a lot conveyed in very little. Yes, it is ambiguous, but that's exactly the feature that makes natural language useful. It's also why it is so much easier to speak with your friends than it is with some random person in your town, you've learned how to compress and decompress each other's language better.

But that's also why we invented formal languages like math and programming. Because there's a lot of times where we don't want ambiguity. Law is basically mankind's greatest attempt at making natural language unambiguous and it doesn't take a genius to realize that that's a shitshow and never going to happen. At the end of the day, to make natural language even relatively low in ambiguity requires a metric fuck ton more words than it would take to express via a formal language (which are also overly pedantic and verbose)

So the problem is that the AI doesn't share those expected decompression strategies. Sure, many humans won't either, but developing a shared language is essential for properly communicating with others. We've all worked with someone who feels like they're speaking a different language. It's exhausting, right?

joe_the_userabout 14 hours ago
Well, if Rene Magritte or some similar artist produces a man riding a bicycle over a tightrope, he's being because he knows what people expect from "a man riding a bicycle over a river" but I think the machine doesn't know the normal expectations and so it's not being creative, just failing. A splatter sheet of an industrial painting operation may look like a Jackson Pollock print. The hired painters might even notice this after their shift. But if the process that produces this is just painting tractors, it's not creative either.
Insanity1 day ago
I wonder how much the 'inflection point' is a thing vs marketing. I'm sure the models got somewhat better, but even now when I'm trying to 'vibe code' a game with the latest models (combination of Codex w/ gpt5.5 and gpt5.3-codex), they really do struggle.

They definitely get something barebones up and running, but it's far from a fully fledged application.

kvakkefly1 day ago
I remember this very clearly myself. Before opus 4.5, I was doing a lot of hand holding and was coding a lot myself, but I have not written code since that day more or less.

I did write some stuff myself just to learn how the enigma encryption machine worked, so wrote myself to learn. But professionally, I stopped coding in November.

krzyk1 day ago
It is sad. I like programming, if I couldn't do it and had to write text (which I do hate, I'm not a writer) it would be make quite a sad world.
bloppe1 day ago
A pattern I've settled into is to write code but leave a TODO for every narrow thing I want the LLM to do for me. Then just tell the agent to fix the todos. It's often faster and easier to give "instructions" this way
yen2231 day ago
Nothing stopping you from doing that in a post-LLM world
satvikpendem1 day ago
Of course you can always program by hand, no one is stopping you.
AussieWog931 day ago
Exact same experience here. Prior to Opus 4.5 I'd sometimes use AI for some frontend webdev stuff (I am a C/C++/Python programmer; my HTML/CSS/JS knowledge is probably on par with a first-year uni student) and I'd have to manually edit things and retry, tell it not to attempt a paradigm that had failed before or cycle between models in Cursor just to try and get one that could make a simple widget that worked properly.

Now, I'm using Claude or Codex (GPT-5.5) for frontend and backend and it just gets it right first time more often than not. I've been making use of things like LSPs, Context7 and CLAUDE.md (global and per-repo) and it just stops doing the dumb LLM things that I hate.

viccis1 day ago
How do you justify your salary given that you're just using a tool that any of us could use for $20 an hour in your role?
rafaelmn1 day ago
How do you justify your salary given that you're just using OSS compiler/editor any of us could use for free in your role ?

AI just changed how I edit code - I still see coworkers (senior developers) failing with Claude/Codex and get stuck when there are trivial solutions if you understand the full problem space. Right now AI is just a productivity tool.

peepee19821 day ago
I don't feel the need to justify my salary, since I'm simply lucky in that regard. But I'm pretty sure you couldn't do my job just because you had access to a coding agent. Most of my time at the office is spent discussing high-level architecture and strategy, ideas, customer requests, backward compatibility, safety, security, quality assurance, etc.

Writing the actual code is a significant part of that, but the codebase is so complex that even Opus 4.7 and GPT-5.5 struggle with it without being fed a *lot* of context and constraints. And even then, they need a *lot* of steering due to making bad decisions that only someone with an intimate knowledge of the theory behind our software is able to catch.

I can only assume that people who think coding agents can completely replace an actual developer mostly deal with trivial software regarding both scope and the type of customers they serve (individuals instead of big companies in industry).

mianos1 day ago
Never to feed the trolls ... but, how does my carpenter deserve $100 an hour when he is using an electric drill and power saw I can get at Home Deepo for $100 bucks?

Most good developers are not employed because just because they can code well.

What is over is: fizzbuzz and trivial CS algorithm regurgitation as a gate.

pastel87391 day ago
How do you justify your salary given that you sit in a chair all day, likely making the world worse, and make 5x as much as someone saving lives, building houses, or teaching kids how to read?
musebox351 day ago
Please see Ben Evans’ podcast on a good take on this. Coding is just one of the task you do in your job, it is not the job or at least it probably is not. You do not get paid to code, you get paid to make a set of decisions that create value to the company. If this is automated then yes sadly your salary is not justified.
aspenmartin1 day ago
Someone competent using them is today a requirement and for awhile will make the marginal utility of skilled workers greater than that of unskilled. The justification is that they are much more productive than they were before.
skor1 day ago
This is _the_ question we must all be able to answer, so here goes my attempt - we all have access to the same tools, before stackoverflow it was forums, books/manuals, so its always been about “getting there, showing up, figuring it out” your hypothetical boss has other things to do than kick a LLM around at that price
MikeNotThePope1 day ago
You can build things quickly with AI, but you can’t delegate your responsibilities to AI. Once the AI starts struggling, you’ll need to takeover and figure it out.
altmanaltman1 day ago
They're using a tool that anyone can use for $20 an hour, sure. But that's not what they're "just" doing. This is what is so insane about non-technical people talking about code - writing the actual syntax is not really the hard part.

What you're saying is like "how do you justify your salary as a NASA engineer when anyone can use Simulink and generate the code?"

It is extremely ignorant.

piva001 day ago
I don't think you understand how programming as a job works, writing code is the final output of the process but it's not the job in itself.
yieldcrv1 day ago
no engineers on staff and stakeholders think the company is incompetent

Coinbase is paying the price for that for every UX glitch, after the CEO was gleeful about HR personnel shipping production code

komali21 day ago
There is no good justification for anyone's salary really, except perhaps doctors and underwater welders.
wilg1 day ago
They don't need to justify it!
bsder1 day ago
Because the tool will happily give you a "solution" that kinda works for a few inputs. It will happily correct itself when you give it more incorrect tests.

It will almost never converge on the general solution that will pass tests you haven't given it yet.

This is why AI is sooo good at Javascript and related slop. A solution that "kinda works" is good enough 9 times out of 10 and if some tests fail well ... YOLO and the web page will probably render anyway.

Contrast that to using Scheme or Lisp where AI will have trouble simply keeping the parentheses balanced.

troupo1 day ago
> Before opus 4.5, I was doing a lot of hand holding and was coding a lot myself, but I have not written code since that day more or less.

I still must hand hold it every day, as it always does things wrong. Especially after it got seriously nerfed in March.

Note: experiences vary a lot depending on the programming language used, and projects. And the experience of the person coding.

bluegatty1 day ago
Paradox - you can get multiple inflection points even as systems start to have dimishing marginal returns in core capability, I think this is due to 'threshold crossing' where something 'becomes good enough for a specific purpose' - it just unlocks capabilities.

'Nail Guns' used to be heavy, required heavy power cords, they were extremely expensive. When they got lighter, cheaper, battery pack ... at some point, they blend seamlessly into the roofers process, and multiply dramatically the work that can be done. Marginal improvements beyond that may not yield the same 'unlocks' because the threshold has been crossed.

asdff1 day ago
Nitpick but commercial roofers prefer pneumatic over battery.
smackeyacky1 day ago
This is a great analogy. Jan/Feb this year was when the models crossed from useful to essential.
magicalhippo1 day ago
I've "vibed" some non-trivial stuff lately using a combination of Codex with 5.5 and Claude Code with Opus 4.7.

Key has been to spend a fair amount of time on initial overall design document, which is split into tangible and limited phases. I go back and forth between them on this document until we're all happy.

For each phase an implementation plan is made. At the end, a summary document of what was delivered and what was discovered. This becomes input to next phase.

I do check the documents, and what they're doing. I also check the tests, some more thorough. And some spot checks on the code to see if I like the structure.

I have mainly used Claude for coding and Codex for design and code review after phases. I ask both to check test coverage after phases.

Managed to implement some tools and libraries without writing a single line of code this way, which have been very beneficial to us.

Since it's so async I can work on other stuff while they plod along.

I think it's not universal though. But stuff that can be tested easily and which you have a firm grasp of what you want to achieve, but not necessarily exactly how, that I've been impressed with.

manmal1 day ago
That’s not vibing, but waterfall development.
whatshisface1 day ago
Waterfall was famous for wasting developer time and extending delivery dates in exchange for simplifying management. If Claude time is comparatively inexpensive, but human oversight remains necessary, we will switch back to waterfall because the relative importance of the two resources will invert.
magicalhippo1 day ago
It's vibing in the sense that I'm not really writing code, and I'm leaving a lot of decision to the models. I let them drive a lot of the design document details, I just made sure it contained the salient points. Implementation plans I just skimmed. Didn't write any code, just did some checks here and there.

But yes, I did think that it sorta felt like being a team lead for some eager programmers.

nopurpose1 day ago
Do you use anything to orcheatrate multiple agent pitted against each other (coder, reviewer, tester, etc)?
magicalhippo1 day ago
Currently just manual. I'm not pushing the frontier here, just getting my feet wet.

While both Claude Code and Codex are capable harnesses, I definitely think there's a lot more to be gained from the harnesses. Quite a few of the times I needed to nudge the steering wheel it was things that a separate agent with the right prompt could have picked up on.

WesolyKubeczek1 day ago
> Key has been to spend a fair amount of time on initial overall design document, which is split into tangible and limited phases.

> For each phase an implementation plan is made. At the end, a summary document of what was delivered and what was discovered.

> I do check the documents, and what they're doing. I also check the tests, some more thorough.

Sounds like programming, but with extra steps.

magicalhippo1 day ago
It's software development, but with much less actual programming (in my case none).

When I said I check the documents, the initial design document was the only I really took a hard look at. The intermediary I just skimmed, looking for red flags or something I had forgotten to tell them. Those documents served as a basis for their work, and as a record of what was done.

Overall I spent perhaps a few hours on each project, over the course of a few days. I'd check in every half hour or whenever I had time, tell Claude "Great, let's do the next deliverable", or GPT "We're done with phase 4, please do a detailed code review, reference the design document and documentation of previous phases". Then I'd leave them cooking.

dawnerd1 day ago
Also the least fun part of development. Maybe I’m the weird one but I like to just jump right in, planning every last detail before writing code is boring.
nothinkjustai1 day ago
None of it is non-trivial tho. You might think so, but it’s not.
magicalhippo1 day ago
It wasn't trivial in that I used a lot of my programming and domain knowledge, both when iterating on the design document and skimming implementation plans.

I didn't use it often, but when it was needed it was needed.

ryanjshaw1 day ago
I find it gets you past the starting line but when you dig into the code it’s a mess of duplicated code, muddled responsibilities, poor architecture, 10k line files that eat your tokens, etc.

I’m building something using LLMs to scrape websites/socials for unstructured event data from combined text/images and the only way I’ve managed to get 100% consistent results for a reasonable cost is to break the task down into very small pieces that reduce the scope of mistakes significantly.

At present, for reasonable complex tasks, Codex/Claude will happily code you into an expensive corner.

ben_w1 day ago
Indeed. To add to this, the obvious solution (ask the AI to break down the tasks to whatever METR says they'd be capable of 80% of the time) is of limited utility, as the AI are only so-so at estimating task complexity.

(Even when they're getting the planning part right, I do also recommend checking the LLM-generated unit tests, because in my experience some of those are "regex the source code" not "execute functions and check outputs").

minimaxir1 day ago
Opus 4.5 in November 2025 was legitimately, unironically an inflection point and is the sole reason for the current hysteria.

GPT 5.5 is a significant improvement over GPT 5.4 but I wouldn't call it an inflection.

baq1 day ago
5.2 and the first codex model were step function changes in capability
vikramkrabout 8 hours ago
It's very real but probably very domain specific. It got really good at a lot of traditional web dev stuff, bash, sql, and writing one off scripts to accomplish random tasks (hence all the agent stuff taking off). And they got good at staying on task. That may not translate to game dev because from what I understand a lot of these gains are basically around post training methods driven by synthetic data generation etc (with potential caveats on how synthetic that data actually is lol). I wouldn't be surprised if the areas of code the llms are good at now are straight up just product decisions of where to allocate budget for generating those synthetic data sets, and game dev stuff might not be at the top of the list because the customer base for that might not be as big
halflife1 day ago
I feel the change. It went from an autocomplete tool, to an agent running 5 tasks in parallel while I just supervise. The improvement is enormous.
orrito1 day ago
While some people got it to work better, for me vibe coding games still didn't reach the point of regular sites/web apps. Physics, creativity, assets and UI/UX still need a lot of hand handholding with the models. Games that are more interface based like point and click or something like reigns are easier though
adgjlsfhk11 day ago
It's very real. Just in the past 2 months or so IMO there's been a pretty big improvement in claude for local dev (although I think a lot of that is less model strength and more harness capability). 1m context is a huge difference (~30 min vs 2.5hr between compact significantly increases the scope of what I get the AI to do before it goes stupid). The other biggest difference I've noticed is a better balance of actually doing the work vs pushing back on bad ideas. I want the AI to tell me if it thinks the thing I am telling it is wrong or a bad idea, but if I confirm, I want it to do that anyway. A couple months ago, the claude was a lot more likely to either say "This is too much work I'm not going to do all of it", tell me the idea was genius (and then pretend to do it) or something equally useless.
DeathArrow1 day ago
>1m context is a huge difference (~30 min vs 2.5hr between compact significantly increases the scope of what I get the AI to do before it goes stupid)

I think the smart zone stays within the first 100k tokens, no mater if the context window is 240k or 1 million.

I divide the work to fit within that 100k and use subagent for the tasks.

danielbln1 day ago
In my experience it's more like 400-500k tokens.
xbmcuser1 day ago
It's real for me as a non coder previously uploading a python script asking it to add this function or that function used to break it now usually it just works at least with Claude and Chat Gpt models. Google Gemini still breaks stuff but rumors are their new flash model that will be announced soon is very good. I am usually working with data in csv files and generating spreadsheet pdf etc and the results for that has improved dramatically.
Scoundreller1 day ago
That’s me. Built a scraper do dump stuff to a csv of a list of images for further ocr and openCV processing. Now I have a convenient list of hits once I run the batch that used to be a loooot of manual sifting.

Once I work out the kinks, I’ll be able to further automate it.

Would have taken 10-100x as long for me to build it without AI and the AI version is probably better.

But yeah, I have enough knowledge to know what prompts are needed and figure out those “oh, I think it’s running slow or failing because of xyz” and further prompt to improve it based on that what I think it should do instead.

And I know where to make slight changes without burning my allotments.

LAC-Tech1 day ago
"flash" or "fast" AI models are worse than useless at coding for me. they make my codebase much worse. It's a maintenance burden.

Gemini Pro on the other hand can be quite a pleasant experience.

ReptileMan1 day ago
Anecdata of 1 but it is real. At the end of last year they passed some invisible threshold and became useful. I don't think it is models themselves, but mostly the much more powerful harnesses and I guess their tool calling abilities.

What changed I think was the context harvesting capability of the models. What most programmers did was - debugging and figuring out how something works were the time consuming part - the fix was usually trivial. And now models could do in seconds what took a developer hour or more.

If right now we create a smart grep that just takes everything for a piece of code and outlaw llm-s we will not regress to the previous level. The developers needed this context as much as llm-s to do their job.

iLoveOncall1 day ago
It is all marketing. The easiest way to tell is that a year ago the same people said the inflection point was X or Y model.

When people claim LLMs just don't work for them, the first question is whether they're using the latest model or not, and if not, dismissing the poster.

The thing is that that same question was being asked a year ago, and even a year before that, but with the models that lead to a dismissal today.

Just make the experiment yourself, wait 6 months, say LLMs just aren't working for the software engineering that you do, and people will dismiss you if you say that you use Opus 4.5 and not the latest model Claude MegaMind 8.8 pro max gigathinking. Despite this model being touted as the inflection point in this article.

vikramkrabout 8 hours ago
My take is there was one big inflection point around opus 4.5 when they got the agentic stuff working and now whether or not it works depends on whether your use case/area of software engineering is profitable enough for the companies to have spent a bunch of money generating synthetic data to RL on, or if it's similar enough to areas that they've done that for. With similar enough being a very loose constraint given how much overlap there is in a lot of coding fundamentals. Tbh if the models aren't working for you now I don't think they're gonna be working for you in 6 months
harshitaneja1 day ago
I think it's because both sides are talking about different things. If you go in expecting it is good enough to make developers obsolete today(reasonable impression to get from the way a lot of people hype it) you would be disappointed and after first couple of tries every few months you would probably not try it much with next generations. Reasonable if it's considered a dichotomy.

But a lot of people excited about new generations(including me, now) are not seeing it as a dichotomy but rather a spectrum where models are getting better and indeed once a year or even 6 months at times there comes a sudden growth which feels like an inflection point from what came before. Practically, it's a tool like any other, you evaluate it based on if it's worth the effort and cost for the benefit you get from it and if it is and has a good DX you use it. If the calculation doesn't work for you, it doesn't. For me, it has gone from a novelty, to good for some kind of quick manual search, to I guess it can debug some kind of errors at times in very specific conditions, to hey I think I am getting a bit addicted to autocomplete in IDE provided by them even if I don't use them for anything intelligent but it's becoming indispensable now but only this part, to it's good for areas I lack expertise in, to agentic sucks I will stick with discussing algorithms and architecture with it on greenfield projects, to holy shit it can do agentic decently well now, I am skeptic to give it access more than in limited cases, to now I am getting close to letting it run free on my device in not so distant future I guess. Some of these were big jumps, at each point I was skeptical of growth. Everytime I thought now the growth will slow down from days 2k context window to millions now. From basic chat completion to working on complex adaptive systems, game theoretic modelling, heurestics and constraint modelling and other things I throw at it. I am still needed in the loop, it can be so smart at times and then will do something so stupid, but the frequency of stupidity is rapidly decreasing. I am still needed, I don't think it could accomplish alone all that it has done for me. But I do at times at night remain awake reflecting on my self worth for the potential day when I don't add that value. When I have a harder time keeping up.

Also had someone told me not in even 2019 that in 2026 we could have NLP models do what they do today, I would have posited it all as sci-fi and here I am waking up in awe of the world we live in and how quickly we adapt.

iLoveOncall1 day ago
You're completely twisting what I said. I've never talked about people claiming it's not making developers obsolete. We are obviously extremely far from that. I'm talking about people who say it doesn't work to build basic features in their projects correctly.

Just take a look at this comment on a different topic, which lists all the pre-requisite for those AI models to work well, from the perspective of someone who has bought into the hype: https://news.ycombinator.com/item?id=48157235

If this is everything needed for an LLM to generate acceptable code, what is even the point of them?

sofixa1 day ago
Counterpoint, I'm also vibecoding a game, and even before doing the "proper" setup (a good AGENTS.md, skills people have published for my chosen game engine, Godot), mechanically, the game was pretty spot on. It looked boring, so I used Claude Design to create a few mockups to choose from, chose the one I liked the most, and told Claude Code to redo the game UI with it.

There have been plenty of small issues like tables not having the columns aligned, or the game menu being a bit offset, or one graph being a placeholder instad of connected to the actual value. And of course I've had to instruct it on all the flavour I want.

But honestly, for a simulation strategy game, especially without doing the "proper" setup from the start, it's been _very_ good.

QuercusMaxabout 18 hours ago
UI fit and finish is really hard for these models, even in with text-mode UIs. The super fiddly stuff still needs to be done by hand, at least for now.
DeathArrow1 day ago
Purely vibe code won't work. You need to define an excellent architecture, have great specs, a solid plan, divide the plan in small phases that fit well in a context window, use TDD and automated code reviews for implementing each phase, do QA and some code review.

At any point you need to have agents review, verify and test the other agents output and iterate until the output is perfect.

And also, have good e2e tests.

IMO, if you don't spend at least a few tens of millions tokens per day, you aren't doing it properly.

fluder_tw1 day ago
Sounds very self confident to claim such thing. Something like "If you don't do how me is doing, then you are doing it wrong"
ssdspoimdsjvv1 day ago
At what point is it easier and faster to just code it yourself? I don't trust myself to write better specs than code.
righthand1 day ago
I mean this blog post and many from this author are pure evangelism and marketing. Can you find anything critical or any dissent from this author about LLMs?
hollowturtle1 day ago
> The coding agents got really good

It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".

All I observe they got better at tool call and answering questions about big codebases, especially if the question has a vague pattern to search, and they're superuseful for that! For generating production code even with a lot of steering and baby sitting?

Absolutely not, not quite there not even close in my experience.

But we should stop talking about 1s and 0s, especially with marketing hype trains, there exist a gradient of capabalities that agents have that really depends on the intricacies of the codebase you're working on, I think everyone has yet to discover how to better apply these tools in their day to day work.

But that totally collides with the current narrative, that flattens out our work to be always the same and that can be automated easily in each case, it's not!

That's why the debate is so polizered imo, there isn't a shared experience

kstenerud1 day ago
The polarization comes from the very disparate coding experiences and output quality that different people find when using these tools.

For example, I've had the opposite experience of yours, generating very high quality work using Claude (such as https://github.com/kstenerud/yoloai). Just in dealing with all the bugs and idiosyncrasies in the technologies I'm using, the agent has been a godsend in discovering and cataloguing them so that the implementation phase doesn't keep tripping over them: https://github.com/kstenerud/yoloai/blob/main/docs/dev/backe...

And the agents keep getting better all the time. Even in the past month I've noticed a considerable jump in its ability to anticipate issues and correctly infer implications as we build out research, design, architecture and planning docs. By the time it comes to coding, it's mostly a mechanical process that can be passed off to sonnet with a negligible defect rate.

Philip-J-Fry1 day ago
I don't want to offend (it's AI coded anyway :)) but that does not scream "high quality" to me. The headline gif on that repo just paints a terrible picture. It can't draw a box correctly, there's random underscores all over the screen. The UI itself is just incredibly incoherent. I don't even know what I'm looking at.

Like, no it doesn't seem like very high quality work... It just seems like a vibe coded tool.

Edit: yes it's wrapping Claude. It's BREAKING the TUI. Not sure what people aren't getting here...

walthamstow1 day ago
Take it up with Anthropic. It's actually their billion-dollar TUI product you're commenting on.

The problem with being such a naysayer is that you're entirely disconnected from what's going on. You haven't tried an agent like Claude Code and experienced it for yourself, so you don't recognise what it looks like when it's in front of you.

vdelpuerto1 day ago
That is the same fight the 2D animators were having with 3D aninmation 30 years ago. The resolution is likely to be the same: the tool wins but the fundamentals stay, and the line between competent and incompetent practitioners moves but does not disappear.
godelskiabout 19 hours ago

  > I don't want to offend (it's AI coded anyway :)) but that does not scream "high quality" to me.
Honestly, I think this is where the big divide is. People have massively different opinions on what "quality" is. Which is okay, but it feels like everyone is working under some assumption that quality is this very clear objective measure that we all agree on. Clearly we don't. We didn't before AI and well... if you can't tell that we don't with AI... you need to take a step back.

FWIW, I agree with Philip here. I don't think this screams "high quality" to me. I'm also not trying to take a shit on your project. Nothing screams "terrible" to me, but yeah, it does look a bit sloppy. There's no polish to it. It looks like someone that grades on "it works" and that's fine. But it also isn't everyone's cup of tea. Where the sloppiness comes in is like what Philip said. First thing I saw was the gif and well... I think Claude Code is sloppy. But this is also a great example at how and where LLMs visibly fail. Creating a box in text is pretty simple. There's tons of tools to do it. And the LLM 100% knows about characters like ⌜⌝⌞⌟⎜, it just doesn't use them and doesn't care. The code itself also looks very LLM generated.

It's fine and I don't think you have any reason to be ashamed of it, but I also wouldn't go around boasting that it is an example of high quality work too. And FWIW, I can't think of a single heavily LLM assisted code where I don't have similar feelings. I've seen stuff with more polish, but yeah, they feel off.

  > TUI
This is a space I feel weird in. I love the terminal. I love that there's a lot of new TUIs. But it also feels very weird because it is extremely clear that a lot of these new TUIs were written by people (or machines) that don't really have a lot of experience in the terminal itself. There's a real shared language by people like me who live in the cli. There's a reason people like me can pick up a new tool and guess certain flags and certain ways to use them. It's because of a shared design language that we know of and we end up writing that way because we know it reduces to cognitive load on our peers. But the LLMs? They don't have that shared experience.

I think this is true for a lot of stuff, not just TUIs or bash tools. Things just smell... off...

kstenerud1 day ago
You do realize that you're complaining about the Claude Code TUI, right?

That's not what this product is; merely a tool it uses.

wanderlust1231 day ago
I think at this point there is no convincing people. Clearly there is value in these tools and it generates code when steered properly. Perhaps your struggles are down to a skill issue.
timr1 day ago
While reading this thread, I literally just caught an agent putting in the following CSS selector in a rule:

> .row > div > div, .alert

This is fairly simple CSS, not multi-threaded systems development. A bar low enough that you could trip over it. I catch this kind of stuff all the time (literally every run), but only because I read every line. Most of it wouldn't be the end of the world for any particular task, but would eventually result in a complete mess.

I think the people doing the heaviest breathing around the elimination of programmers either aren't very good at programming, or they're not paying close attention. Or they're hyping their book.

kstenerud1 day ago
I haven't done any CSS/HTML/JS level work with Claude yet. I've mainly been using it for systems level stuff.

LLMs have traditionally had problems with visual rendering (the good ol' pelican on the bicycle test). I wonder if this is more of the same?

habinero1 day ago
> I think the people doing the heaviest breathing around the elimination of programmers either aren't very good at programming, or they're not paying close attention.

Yeah, absolutely. People think you're picking on, like, code formatting and no, dawg, your code doesn't do what you think it does, or it only handles the happiest of happy paths.

I do find it funny when people get mad about you critiquing their AI project. You didn't even write it, dude.

sjagauanbdvva1 day ago
Or they don’t know CSS.

Amazing how the LLM is godly with things I don’t understand, and falls over completely when it works in my domain… I wonder why that is /s

hollowturtle1 day ago
Don't want to be rough, but I'd like to read experiences about novelty ideas that solve people real problems in the real world, your project it's just about selling new shovels.

As I commented on another thread

> If you're trying to solve a HARD problem people REALLY have, it's a novelty that agents can't help with, otherwise if it gets 97% there MAYBE it's just a signal that your idea isn't that novel!

paulluuk1 day ago
This is a pretty wild take. What percentage of human engineers are creating novel solutions for hard problems, you think? I work in R&D and even my work is 90% doing things that other people already solved. If you are really doing cutting edge SOTA work that has never been done by another human in some form or another, then kudos to you and I want your job.
jiggawatts1 day ago
As a random example of a "hard" problem solved by AI that I couldn't have realistically done myself, despite having decades of wide industry experience:

Reverse engineering a proprietary protocol from a binary executable.

I heard about people finding security vulnerabilities in compiled code with the combination of Claude Mythos wired up to a disassembler like NSA's Ghidra. Someone here mentioned that GPT 5.5 "extra high" is just as capable, I had a problem to solve, spare token quota for the week, so... I gave it a go.

My problem was that I'm working with a product that uses a legacy 1990s style network appliance output log format that is proprietary, undocumented, and has no publicly available decoders other than an app by the same vendor, and that app has fundamental limitations. (I.e.: it's nothing like Splunk or Elastic.)

Codex with a Ghidra MCP bridge figured it all out: the framing, bit and byte packing, endian order, field names, data types, etc. It made me a neat little protocol parser in a modern language that I can use to spit out something sane like NDJSON or OTLP protobufs.

There is no way I could have reverse engineered this myself from compiled C++ code and/or packet captures! The format isn't self-describing and is incredibly dense (similar to NetFlow). In a hex viewer it looks like line noise!

kstenerud1 day ago
The comment was directed at:

> For generating production code even with a lot of steering and baby sitting? Absolutely not, not quite there not even close in my experience.

As I said, this is an example of using AI successfully to produce a high quality product (one that I use every day).

But to your point: I am solving hard problems that people really have. You just don't see those because I haven't mentioned them publicly yet. And they won't be released or talked about until they're ready.

ThrowawayTestr1 day ago
Claude wrote me a little python script to help me sort and rank all the AI videos I've generated. It also extracted the metadata and organized it into a CSV. I sent it some hex dumps of the header and it got it first try. The header structure of webms generated by comfy are pretty novel.
windexh8er1 day ago
A standard Docker container, with the container UID/GID mirrored to the host user, holding the host user's API keys, with the host user's project directory bind-mounted. The tooling doesn't even use gVisor / Kata by default which could implement the claim made, but in reality this entire project appears to be security theater.
timacles1 day ago
I’d like people to notice that those who claim this amazing AI productivity boost are always: pushing out software they don’t know how to judge the quality of and pushing projects that are 70% done. Every. Single. Time.

I use Claude all the time, it is immensely helpful. It is also very nuanced and requires a high level of expertise in a specific domain to produce quality work. Even then, that take time and effort. Anyone saying otherwise, quite frankly, doesn’t know what they’re doing.

wickedsight1 day ago
> The polarization comes from the very disparate coding experiences and output quality that different people find when using these tools.

Not just when using tools, also when using humans. The frame of reference of what is considered 'production code' differs immensely between organizations, teams and people. The code I get from LLM's is usually much better than what I get from my peers. Maybe not one shot, but after some steering it gets there.

It also isn't lazy. When generating test cases for relatively simple pieces of code, it usually tests pretty much every path and doesn't stop right at the 80% code coverage quality gate.

I can imagine if you're at the level of Linus or something, you might conclude differently, but most people aren't there at all.

ryandrake1 day ago
> The frame of reference of what is considered 'production code' differs immensely between organizations, teams and people.

I think it’s really down to this. Nobody can agree on what counts as production-quality code. I remember joining a company with what I think (hope) most of us would call horrible quality code. It was an absolute mess, barely compiled with hundreds of warnings, and had uncountable number of bugs. They didn’t even have a bug tracker so nobody even knew how many they had.

But the people working there already were so proud of it! None of them had ever worked for another company so they had no idea how bad their code was in comparison with the rest of the software industry (which itself is a very low bar). I told the founder we had a huge code quality problem and he looked at me like I had horns growing out of my head.

When someone says their LLM is producing “production-quality” code, actually look at it and see. Arguing about it on HN is pointless because everyone’s quality bar is different.

kstenerud1 day ago
Absolutely! I find its test generation, properly steered, to be top notch. In many ways it's like having a second head, because it'll spontaneously come up with test paths that I'd normally only get to after a month or so in one of my "aha! What about XYZ?" shower thoughts.

You'll also notice that Linus doesn't poo-poo AI at all. His only gripe is with people using it wrong, such as flooding security lists with drive-by security reports after pointing their agent to the code and saying "find me some VULNS!!1!1!!"

hollowturtle1 day ago
> The code I get from LLM's is usually much better than what I get from my peers

Then you should seriously question for who you're working for imo.

> It also isn't lazy.

It is indeed lazy in my experience, as in being overly zealous when creating useless test cases and ignoring the important ones. I don't want it to test a sum I want to know a test that can "guarantee" me that a further change doesn't break existing code. And producing this high quality in tests is HARD, and requires a lot of steering with agents. This culture of tests code coverage is just wrong, the best code base I worked with had code coverage only on the net percent of code that matters, the rest is covered by for static type checking and integration tests

dominotw1 day ago
not going to look at your vibeslop
topherhunt1 day ago
@hollowturtle I'm surprised - do you really find that sota models aren't good enough to generate production code with steering and babysitting? My experience (Claude Code, mostly Opus 4.6) is that it's fantastic at this. At least in JS + TS + Elixir + Ruby. It does indeed need babysitting, my mental model is that it's an exoskeleton not a junior dev, but IME it's a friggin badass exoskeleton, easily 10x-ing my speed on most work. Notably I do NOT --dangerously-skip-permissions nor use claude code's auto mode, I micromanage and lightly review every line it's writing as it writes it, so I rarely have more than 2 sessions generating simultaneously. I suspect that a lot of the disappointment comes in when people try to delegate to it and trust it to not go off the rails. It hasn't earned that trust from me yet (and hasn't needed to yet).

Granted, I'm mostly working in small-to-medium codebases, 20k-30k LOC incl test suite. I wonder if that's a factor in my positive experience. Curious to hear your thoughts.

hollowturtle1 day ago
It really depends on the task, but, in my experience, small to medium and bigger codebases, the amount of steering to get quality code is not worth it.

I see patterns and solutions emerging from hand coding, I'm not the other way around, I can't start with a prompt, unless again I have the feeling that the task can be one-shot with minimumn effort and context.

Starting with a prompt, or in plan mode, it's not how I trained as an engineer, I cannot foresee what something should be/look like until I explore it myself with code I can relate to, that I'm connected with and that I fully understand, for example my muscle memory suggest me to use a specific data structure only after I see some code patterns emerging, hard to explain hopefully makes sense.

If I ask the agent to do that initial exploring, even with a tremendous amount of instructions, guidelines etc. it usually start with a path I wouldn't have started with. What I tried in such cases is to stop it, correct it and generate again, only to end up with more prompt words than lines of code. This is true for every visual task I'm working on (I program non web UIs). Let alone doing it via spec files, if it's something I don't care about yeah sure, maybe a little tool for entering/editing data, but alas it always default to slop web apps, and I get it I mean most of the training set is on web apps

travisgriggs1 day ago
> quality code

Probably where the mismatch is in this discussion. The measure of what is quality code is all over the place. For some, some form of "good enough" is quality. And for others, metrics like terseness, readability, vacuous amounts of comments, cleverness, various fuzzy measures of "idiomatic", etc, make "quality code" much more of a moving target.

dasil0031 day ago
I think this depends a lot on the task, the existing codebase, and the taste of the operator.

In general I tend to agree with you if you're talking a codebase you are deeply familiar with, the value-add from have agents write the code probably ranges from very small to negative in most cases.

On the other hand if you're trying to make changes in systems you are not familiar with, LLMs are a huge speed boost to folks with enough experience to sniff out what would be a bad path essentially via socratic method to the agent.

Obviously there are no silver bullets and no substitute for judgment. I will say though, I'll tradeoff ugly local code for good data models and interfaces any day of the week, and there is definitely an archetype of engineer that is very precious about code without good judgment on where it matters and where it doesn't.

netcan1 day ago
Coding goodness is just "unevenly distributed."

Irl, (a) different people's ways of working with ai are a million little islands and (b) bottlenecks vary enormously by coder and codebase/task.

Also... I think our era has an intrinsic bias that change=progress, productivity, etc.

Take the "networked computing revolution" of 1990-2000. These computers did land on every desk and every pocket. They are administration powerhouses. Excellent for all manner of administration tasks.

But... what this netted out to is "change." We send a lot more emails than we did letters. We communicate a ton. Secretaries went extinct. But "administration" grew.

A university faculty typically has more admins. Companies hire more accountants, HR, project managers, etc.

Maybe administration was never really a bottleneck.

Code has a lot of this. Everyone has a road map, wishlist, etc. It appears as though "code capacity" is the bottleneck. But maybe most of those companies can't really generate much more value from more software.

Anecdotally, it seems that many mid-tier shops are migrating/ modernizing their stack, and suchlike.

I haven't heard of many belting out features, and increasing prices or sales.

Most bottlenecks are upstream of another bottleneck. Few are a "dam."

ncruces1 day ago
I don't know that there was an inflection point. I know that, over the past year, they definitely became useful to me as more than auto complete.

My most recent pet project is a transpiler from Wasm to Go, and I find it incredibly impressive that recent models (I've used Sonnet, Opus and Gemini, far more successfully than GPT), they're able to just pick up the project and work at all these levels:

- Go code that implements the transpiler (parsing Wasm, building an AST)

- Go code that gets generated by serializing the AST to a .go file

- Go code that manipulates the AST (to optimize it), and its effect on the generated code

- Go code that's grafted to the generated code (to implement more advanced opcodes) and how to interact with it from the AST

- C code that gets compiled to Wasm, then translated to Go, then called by Go

- Go code that gets called by this C code to implement a C stdlib

- WAT and WAST files that are used to implement the Wasm spec tests

I find this impressive because I have to think hard about all these levels, and I feel many programmers would have a problem with this.

And it's very often way easier for me to just write: "I want to generate this code, build me the AST that does it", than go "count parenthesis" in the Go code (I do have some LISP experience; it's still easier).

Feel free to scrutinize/criticize the code. Not vibe coded, but plenty of GenAI help.

https://github.com/ncruces/wasm2go

Glohrischi1 day ago
I had a really fun day yesterday because anthropics limits on their normal 20$ subscription allowed me to play around for the whole day without hitting a limit.

Its 'production' code because its a small browser game which has very small to 0 requirements on security and being perfect but high requirements on 'ever even doing this' and 'fun'.

The code it generated hat 0 compiletime errors. I was able to descripe 10 things to do in one task and it just jugged along solving all of them.

This doesn't need to become so much better to be useful. Its already very useful for a lot ofuse cases like researchers which have to verify the math anyway but are not good in writing code for filtering their testdata, converting them and running it.

Small websites, fun projects, helper tools etc.

But while we speak, in the background stuff is still happening left and right. More compute, better algorithm, more RL etc.

We could already be at 95% at 'ai will take your coding job' without knowing because these 5% are so relevant.

forlorn_mammoth1 day ago
> The code it generated hat 0 compiletime errors

And no spelling errors either!

Also,

> Really? What duplication did you actually find? I count a few small ones in buildMounts and ReadPrompt, maybe 20 lines or so, but hardly anything worthy of such an epithet

>> embedding-shape 1 hour ago | root | parent | next [–]

>>The duplication I'm seeing isn't just "same text repeated" but structural duplication. Doing a quick 5 minute look again just to give you some pointers; runtime.MountSpec construction in buildMounts, Workdir vs aux-dir mount-mode handling, repeated one-off mount append blocks, overlay detection and so on, the list goes on. Just those should account for 200+ lines.

If you don't see any errors or problems, is it because there aren't any problems to see, or because they take a trained eye to spot?

hiw2d1 day ago
"We could already be at 95% at 'ai will take your coding job' without knowing because these 5% are so relevant."

This is nonsense. Im not a SWE but a CEO, if that were true I'd be firing without a hitch. And yet this is not the activity we see. Why is that? Perhaps merely writing code is not the entire job.

Glohrischi1 day ago
I wrote coding job. And its true for coding jobs.

Your Product Manager is not a coding job. Your Product Owner is not a coding job.

vibe-kanban exists you could already do a proper experiment letting your PO maintain a vibe-kanban board with proper requirements and see how an agent progresses.

But 5% is often enough wwhat breaks it. Doesn't help much when your PM, PO or CEO or CTO have no clue about coding harnesses, coding agents, coding platforms, LLMs etc.

keybored1 day ago
CEO makes fresh account to tell someone that writing code is not the entire job? I don’t buy it.
keybored1 day ago
I don’t see how “fun projects” and “take our jobs” fit together in any voluntary sentence.
Glohrischi1 day ago
Firstly i wrote examples but also etc. so its more than just that. It is also refactoring, cicd pipelines and co.

2 years ago when I prompted something, it had compile time errors left and right. Took me 3-10 iterations to even get it running.

Now its one shoting a lot. Including websides, refactorings, etc.

The question is what is missing? How far are we that it can handle huge code bases vs. smaller ones? How far are we that it can comprehend the whole architecture and doesn't try to put a service in a wrong place just becaus the context is too small?

Mythos is 10 Trillion, that might be already pushing it.

95% might be not enough for someone in sense of "yeah i can't do the 95% and i can't do the 5% either the AI can do 100% or i still need Kevin with his knowledge even if its just for the last 5%"

Veelox1 day ago
My explanation for the lack of shared experience is very language dependent quality. I work in Go and it's gotten really really good. I have to pick the right abstraction and it can be overly verbose at times but it can make in 5 minutes what would have taken me an hour.
DennisP1 day ago
Steve Yegge wrote about this in his book Vibe Coding. He says it takes about a year of experience before you're consistently getting good results. He writes about lots of different techniques for doing that, but also says a lot of it comes down to just getting a feel for when the LLM is going to go haywire.
noisy_boy1 day ago
> but also says a lot of it comes down to just getting a feel for when the LLM is going to go haywire

That has been my experience too. The days when I'm very focused, being extra deliberate and constantly questioning/examining/challenging things, the results are much better. Autopilot days just go through in a daze and the outcome is objectively worse. This has made me much more hands-on and pushed me towards models which are actually not that "clever" like codex at effort=low but fast. Given that I'm doing the meat of the thinking, might as well not be slowed down by the model and lose the flow.

voncheese1 day ago
+1 to all of this. The challenge can be staying focused and thinking when the AI assistant is (1) moving very fast and (2) often times doing multiple things at the same time.

I know I have struggled to keep up, and fall into the trap of approving things (either commands or recommendations) without taking the time to really process and think about them.

It's a bit like the age old problem of "it's super easy to ask questions, and can be super hard to answer many of them". So the economy of the conversation gets out of whack fast.

nijave1 day ago
I'd say closer to 6 months for me but probably still some room to improve.

I think getting a decent setup with a fast feedback loop for the agent combined with context (in repo markdown)+memories goes a long way.

After having Claude Code "remember" my preferences and tools, it's more efficient.

It has a tendency to copy existing patterns so a good AGENTS.md with best practices and architectural goals goes a long way to prevent it from duplicating patterns you're trying to get rid of.

hollowturtle1 day ago
It's been 4 years of using them for me, before writing a book I'd wait to have a decade of experience to share with others, otherwise it would have the same value as a book on a react tutorial
datadrivenangel1 day ago
From what I've heard this is a good assessment of steve's book.
JeremyNT1 day ago
> It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".

I think this may depend on the sorts of work you do. For those of us who mostly live in web using established frameworks, that's about when I came to conclude they could do everything and do it well.

I can have opencode discover third party APIs and generate fully working solutions that are well integrated into an existing long-lived codebase. I still review the MRs by hand but I only ever discover spec errors or style issues, not defects in the code itself. This was a big change from ~summer 2025.

This is a really well defined space though with strong conventions. If you're doing something more interesting YMMV.

JodieBenitezabout 16 hours ago
> Absolutely not, not quite there not even close in my experience.

Well... I don't know what you expect but so far I'd like all my colleagues to write code at the level of what I get from codex.

benterix1 day ago
I believe by now we know exactly what it's good at and what it's terrible at.

The problem is that our CEO's fear of the future that pushes them to peculiar decisions that objectively make no sense (cf the infamous discussion of the Microsoft employee on Github that couldn't force its agent to do the proper thing).

It's not the first time I witness this kind of discrepancy and probably not the last, I just learned to adapt to it.

rconti1 day ago
I'm moderately horrified every time claude runs the same broken, YOLO SWAG git commands from stackoverflow, gets errors, tries a few more things, then finally figures out how to commit and push correctly.
sroussey1 day ago
Long term, it can be better to slowly refactor parts of your code base into the way the model expects it to be. Sometimes fighting the gradient of code’s uniqueness vs expectation is not worth it.
psadauskas1 day ago
I first started noticing they were actually useful around Dec 2025, through about February. I got pretty good at using them, and was amazed at their utility, especially Claude and Codex. Then sometime in March, they got really frustratingly dumb. Things that they used to get right in one shot suddenly took several tried, and I had to watch them like a hawk because they constantly made stupid mistakes, not following instructions that previously worked. I had one try to fix a failing test like this:

    assert_eq x, true if x == true
Both Claude and Codex, both with the latest versions and the original versions that had been working.

Now I just use deepseek. It isn't any dumber, and it costs way less.

Razengan1 day ago
> It's since November 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".

You can dig up my past comments semi-arguing with simonw where I said AI just isn't good enough yet, but lately I've been using Codex mostly just to review existing Godot/GDScript code: https://github.com/InvadingOctopus/comedot

and now I'd say that in this day and age one would have to be dumb to not use AI in SOME way :)

It's helped me catch a lot of bugs that would have taken me a long time to even notice on my own. I guess it helps that the project is modular enough where most files can be considered standalone, with just 1-2 dependencies and well-commented already, so the AI can look at each file on its own one at a time. You can see the AGENTS.md I use on that repo.

Most of my productivity in the last 3 or so months has been thanks to AI, though none of the code there is AI generated. I even bought a MacBook Neo just to use as an "AI thin client" while on travel, even though I already had a beefy MacBook Pro M2 Max that I just keep at home/hotel as a desktop now. Codex's recent remote control features have made it more useful for the moments when I get a cool idea while out at a cafe or on a walk.

I don't just copy-paste the AI's output, because it's often inefficient anyway (like creating redundant variables/functions), but I find its findings useful for manually cleaning up my shit. Maybe their training data is not that good with GDScript yet which is a bit of a jank language anyway.

So my core code is wholly made by meat, but I do have fun now and then telling Codex to make experimental games using only the library of modular components I have written so far, to test my framework and also the AI's abilities. This kind of work seems like a surprisingly good match for AI: It just has to put existing blocks together, that already have well-defined interfaces/contracts etc.

I've been on the $20 ChatGPT plan for about a year now, and only started using Codex since like maybe 4 months ago, almost always on the latest model with "Extended Thinking" or "Extra High", because I want my shared code to be as correct as possible because everything else I do depends on it, and I only hit limits like 2 times in the last 3 months.

Claude on the other hand, terrible: https://i.imgur.com/jYawPDY.png

Grok is OK for general stuff, never tried it for coding.

Gemini's UI/UX and lack of privacy and the AI itself is so terrible I tried it just maybe 2 times ever...and it refused to work on Google's own Flights website and reverse image search! (it told me to do it myself)

Deepseek refused to talk about Taiwan or Tiananmen Square so I'm not sure if I can trust it for anything else lol

maccard1 day ago
> I've been on the $20 ChatGPT plan for about a year now, and only started using Codex since like maybe 4 months ago, almost always on the latest model with "Extended Thinking" or "Extra High", because I want my shared code to be as correct as possible because everything else I do depends on it, and I only hit limits like 2 times in the last 3 months.

I've recently tried codex, and I have it set to plan mode with 5.5 and I'm hitting the limits on a single task on a "medium" sized codebase.

Razengan1 day ago
Like I said most of my prompts cover 1-3 files at most, rarely more
hollowturtle1 day ago
Thanks for sharing your experience! I totally agree that if you "own your code", as in you're invested in it, coding it and documenting it, these tools can be really valuable for review, bug fixing and maintenance, it pushes you to do better, maybe one piece at a time like you said with a good modularized codebase. I think more devs should share experiences like that, we should overthrow marketing and people narratives that "don't code anymore since X"
jaccola1 day ago
I set up a hook that reviews every commit and highlights potential bugs (async) and writes to a report to a dir.

Then I have a script that summarises that I usually run before pushing or at end of day.

Works quite well for both improving my code and the code ai wrote.

prettyblocks1 day ago
I'm curious. What have you actually tried? Are you just prompting the LLM with one off tasks? For good results, you need to take the time to read the documentation for the harness you are using and configure your environment. This tuning can take dozens hours to nail down. Then there's the actual approach for working on your projects. Many people that have good results with agentic coding actually spend the bulk of their time in plan mode where they go back and forth with the LLM designing a granular playbook for the task at hand before they ever have it write any code.
hollowturtle1 day ago
I'm curious. What makes you think that me sharing an example(which one of the many?) of what I actually tried would somehow add something to the conversation? What's the usefulness of just an anecdotal example?

As I said we have a plenty of different envs, codebases, requirements. Things are complex.

You're posing it like I tried just one time. It's been hundreds of hours of tries and I just found out what works best for me, like everyone should do. My original post above isn't that hard to understand.

Let me stress this out again:

> That's why the debate is so polizered imo, there isn't a shared experience

prettyblocks1 day ago
In my experience most people with the type of critique I'm seeing from you have only tried it one time or have not taken the time to invest in an environment/process that will work for agentic coding.

My question is not so much about sharing a cherry picked example, but the question was more like "have you tried in earnest to make it work". That's the part that wasn't clear from your original post. But you say you have, and you weren't impressed. Fair enough. I'm not trying to convince you otherwise, but I encourage people to give the tools a fair chance before throwing up their hands and deciding it's meh.

Having said all that, you're right there isn't a shared experience.

newaccount6701 day ago
Good is relative. If somebody struggles with getting their hand-written code to compile, the LLM coding agents will look like geniuses to them.

An idiosyncrasy of humanity is that the dumbest individuals tend to also be the loudest.

Falimonda1 day ago
Which languages and subject matter do you work with?
hollowturtle1 day ago
c/c++, java, kotlin, go, some perl scripting, some javascript. Gaming industry
randusername1 day ago
We'll there's your problem.

F1 mechanic pops the hood of a mass-market Toyota Corolla and doesn't understand why everyone says it's really good.

A lot of us are out here building websites or phone apps.

Not to say that these things can't also be taken very seriously from first-principles, but I think that's rare.

nijave1 day ago
Have had fairly good luck with Claude Code Opus 4.7 on xhigh effort.

I think it more reliably does IaC with established patterns especially when it can do a dry run.

Python is pretty decent but usually you need good prompting and a little bit of steering to prevent slop. The slop usually works tho

Codex w/ gpt-5.5 seems faster but maybe just a bit below Opus 4.7 quality.

I gave Opus access to a repl (pyrasite-ng) in a running Python process and it managed to find an 8 year old "memory leak"--a module level cache with no eviction. It did that using GC module and exploring the heap. I was pretty happy with that outcome. It would have been quite challenging for me to find myself without at least a few weeks of deep diving into memory leak hunting docs/resources.

eudamoniac1 day ago
I'm convinced that the polarization is that one's impression of AI has a direct 1:1 mapping with one's previous level of skill and sensitivity to quality. Most people are by definition average and they are impressed.

Is there anyone in the industry noted for their skill, quality, and taste, e.g. Jonathon Blow, who is impressed and thinks the AI is really good? I haven't seen any. In my personal circle, the best devs I know are either micromanaging or shunning AI; none of them think the agents are capable or really good. The mediocre devs I know are largely on board. This applies both online and off.

Couple this with the fact that no AI focused project has come out, not a single one, that meets a high quality bar with nontrivial complexity.

I am an AI quality sceptic. They can be useful if you don't care for quality, but I never don't care for quality. I live for quality.

ajam1507about 21 hours ago
This is contradicted by the amount of AI use at top tech firms.
eudamoniacabout 4 hours ago
No, it would be contradicted by the sentiment among devs at top tech firms, but I don't know what the sentiment is. I do know they are being forced to use AI at peril of termination such that their use of AI is a non signal.
epolanski1 day ago
> But we should stop talking about 1s and 0s

I agree, but you contradicted yourself just one line above.

> For generating production code even with a lot of steering and baby sitting? Absolutely not

Moreover this is further in contradiction with several facts:

1. the majority of this industry has always been composed by mediocre/bad developers, often unable to write a fizz buzz

2. the majority of work in this industry is implementing mundane CRUDs to move and transform trivial data across the organization's stakeholders and/or customer or third parties

3. there's lots of stellar and respected engineers leveraging the tools on a regular basis even on problems that are far from trivial and outputting quality code much faster than they would've done otherwise. Mitchell Hashimoto has blogged about it in his work on Ghostty, Sanfilippo has blogged about it in his work on Redis and so did plenty of others. I know several open source stellar developers who benefitted greatly from these tools, yet you think it cannot improve the quality and output of the most mundane tasks out there?

hollowturtle1 day ago
>> But we should stop talking about 1s and 0s

> I agree, but you contradicted yourself just one line above.

>> > For generating production code even with a lot of steering and baby sitting? Absolutely not

with this last sentence I obviously meant in my experience, it's not that hard. I don't buy your facts are highly biased towards web development, that's a common mistake here on HN to think it's the totality of the industry, luckily it's not

epolanski1 day ago
I've quoted you two tools (Ghostty and Redis) whose development now regularly uses AI assistance to deliver production code. I quoted those because their authors shared their experiences, the strengths and the limits of the tooling.

There's many more, from Flask to Docker, from Ruby to FastAPI or Tanstack. LLVM has integrated AI-generated PRs, so did Swift and Mojo. Sasha Levin has pushed into Linux Nvidia-related kernel changes that were authored by LLMs in 6.15. You can be certain there's a magnitude more where people don't admit or tag their PRs as AI generated or co-generated.

In fact I am quite confident that projects and developers that are not leveraging the tools are increasingly rare. There's really no reason in 2026 to write a non-trivial PR and not ask a cheap review to an AI tool.

The industry is changing, I don't really like the trends I'm seeing, but to state that LLMs cannot and are not writing production code, very often quality ones, (especially when used, setup and overviewed properly) is plain denial.

Your anecdotal experience isn't relevant, especially when applied to the largest parts of the industry, composed of mediocre developers working on terrible codebases.

treme1 day ago
you are experiencing reverse Dunning–Kruger effect.

For someone that just dabbled in coding prior, it went from AI building 80%, and struggling through to finish the 20% when trying to build an app/website.

now it's like 97% and struggling with last 3%. Yes it'll look rough around the edges when evaulated by a senior dev, but being able to build MVP level things to completion with ease helps you stay engaged and motivated to continue and learn.

hollowturtle1 day ago
Please do not cite Dunning–Kruger effect at random.

Who needs to generate a dumb demo of a 97% done crud app? We had code generators for those, everytime I read claims like that and I ask to explain further I then discover it's people who were not productive before generating the so called "MVP level things to completion with ease".

If you're trying to solve a HARD problem people REALLY have, it's a novelty that agents can't help with, otherwise if it gets 97% there MAYBE it's just a signal that your idea isn't that novel!

LLMs can effectively validate your business idea

AussieWog931 day ago
I don't really see your point. Most problems that people have aren't really super-novel, but just extremely bespoke.

To give a specific example, 12 months ago I had a client pay me me to make a Chrome plugin that changed the rows in his Shopify Products page to display Quantity and SKU.

These days you'd just one-shot it in Claude.

y0eswddl1 day ago
I'm beginning to get the sense that Sturgeon's Law is at play here and the non-crap 10% of us are arguing with the 90% for whom LLM's shitty output is actually better than what they could do on their own.

I've been lucky enough to work at places with majority intelligent engineers with similar tastes on quality to my own... but it seems to be that's not the norm or the case everywhere.

and it's the 90% that's most vocal. Sturgeon and D-K seen to go hand-in-hand.

jaccola1 day ago
The obvious pushback to all of the slop is: coding was never hard. Learning resources were abundant and free.

If these people had a burning desire to build things prior to LLMs and couldn’t put in the effort to learn to build them (which is also fun!) then why would they ever put the effort into anything to understand it and make it good??

jmcodes1 day ago
What would you consider a "hard" problem?
simonw1 day ago
Give me a "hard problem" and I'll give you a Codex or Claude Code transcript showing how I'd use them to tackle it.
nayroclade1 day ago
> It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".

The answer is "for lots of people, but not you".

You're doing a vague impression of being fair and even-handed, arguing for non-polarization, but underlying everything you're saying is an obvious attitude of poralizing superiority: That _your_ personal experience with AI is the real truth. That _your_ codebase is more intricate and more challenging than what other people are doing. That everyone else is being led by a "marketing hype train".

hollowturtle1 day ago
When I say

> Absolutely not, not quite there not even close in my experience.

I obviously mean in my experience, not the real truth.

> That everyone else is being led by a "marketing hype

That is obvious instead, and I later say there's not 0s or 1s, every job has his intrincancies

zarzavat1 day ago
Somewhere right now some human artist is being tasked with drawing illustrations of pelicans riding bicycles to be used as training data at a big AI lab.
minimaxir1 day ago
Every modern image-generation model can generate a pelican on a bicycle trivially. The point of the test is to generate SVG text that represents an image, which is more complicated.

Yes, there are ways to convert raster images to SVG for use in training data but it's not a good use of anyone's time.

Antibabelic1 day ago
I don't understand this response. Human artists can and do make SVGs.
captn3m01 day ago
They typically use a visual editor like Inkscape with visual feedback. Nobody is hand-coding a complex SVG.
jofzar1 day ago
I wouldn't wish creating a svg pelican on a bicycle on my worst enemy
Mashimo1 day ago
> Every modern image-generation model can generate a pelican on a bicycle trivially.

Mistral seems to be the exception. Their new model from a few weeks ago is worse then selfhosted gemma.

energy1231 day ago
The quality of the Gemini pelican was such a step change in one iteration, while the other benchmarks remained quite flat, that I think you are right. Although whether they targeted Pelicans in particular or just svg, I can't say.
GistNoesis1 day ago
Last 6 months is humanity losing control of LLMs.

- Memory market cornering which mitigated the adoption of local AI despite great open model being released.

- Fast penetration of IP exfiltrating tools in companies world-wide.

- Developers producing more code that they can read.

- Autonomous agents killing Open Source by siphoning the attention economy

- Autonomous agents destroyed online communities (including HN)

- Autonomous agents being used in warfare (targeting, propaganda...)

- Widespread vulnerabilities discovered, Widespread supply chain attacks.

- Increasing inequality, fracture in perception, Green indicators, Grim realities.

sigmoid101 day ago
If you only read bad news (i.e. mass news these days since that sells better) this will be the picture. But I have personally seen some insane stuff happen in biotech. Like, I can't believe we're lucky enough to possibly live our life in this kind of future. We already have actual therepeutics developed using Alphafold being tested right now in real clinical trials, but the next generation of stuff that will go into trials in the next 3-5 years will be insane. We will look back at current medicine like we look back at medieval times today.
biophysboy1 day ago
Protein structure is not a rate-limiting step in drug discovery.
okamiueru1 day ago
AlphaFold is not an LLM. As such, it isn't a fitting example for "good news" related to LLMs.
nektro1 day ago
Alphafold isn't generative and using this as a rebuttal to OP is bad faith
rm_-rf_slash1 day ago
My mother is going on 5 years with multiple myeloma, a cancer that would have offed her in 5 months if it weren’t for advances in maintenance chemotherapy.

Medicine has done amazing things in my lifetime.

viking1231 day ago
Nothing ever happens.

See you in 10-30 years when people are still dying of the same shit as today like oesophageal cancer and glioblastoma.

Maybe in the next century but by that time you and me both will be under the ground, and no, Amodei's doubling of human lifespan simply won't happen.

nijave1 day ago
I think it's just further exposed cracks in software engineering that were always there.

Ideally we'll come out of the AI hype cycle having learned better practices.

Asraelite1 day ago
> Widespread vulnerabilities discovered

This is a good thing

evdubs1 day ago
> Widespread supply chain attacks.

This is a bad thing.

felooboolooomba1 day ago
That is a half-truth.
willis9361 day ago
Metal Gear Solid 2 was quaint and funny until 2025.
TeMPOraL1 day ago
> - Memory market cornering (...)

Wait, what? What is that?

> - Fast penetration of IP exfiltrating tools in companies world-wide.

That goes on the benefit side, I believe.

> - Autonomous agents killing Open Source by siphoning the attention economy

Anything attention economy disappearing is a "good riddance" to me.

john_strinlai1 day ago
>Wait, what? What is that?

i believe they are just saying that RAM prices went crazy

LZ_Khan1 day ago
I'm curious how the 6 months have looked from a non-programmer's perspective. What kind of co-working tools and similar optimizations have people from other fields experienced?
opto1 day ago
I am an instructor who helps deliver an apprenticeship. My new boss has been in our industry for about 20 years and is one of the most respected people in our company. They've just joined us to teach and are off doing a two week course. On the first day she was told to let AI write all of her lesson plans, and then feed the lesson plans to AI to make her slides...

Hopefully she rejects all this out of hand, but if she doesn't it'll mean that none of our trainees get the benefit of her experience, who she is as a person, and what she has to pass onto them.

We have 6 monthly reviews as instructors where we are told the same thing. "How could you use AI for your teaching?"

They don't even feel the need to justify why this would be desirable, or is needed at all. It's just pure bandwagonning. Unbelievably, most of my coworkers are extremely positive about AI, although none of them have told me they use it for anything besides preparing their lessons for them — they just use it instead of having to think, or spend time preparing...the only important thing they do at work.

It makes no sense to me.

tkgally1 day ago
I’m teaching a class at a university in Japan (on AI-related issues, as it happens). I’ve been teaching for more than 40 years, but at 106 registered students this is by far the largest class I have ever taught. AI tools are very helpful for class management, such as keeping track of attendance and homework submissions.

I have to consciously avoid using AI for more cognitive tasks, though. It would be very tempting to have Claude, ChatGPT, or Gemini summarize, classify, and grade the students’ assignments, write individual feedback, prepare my lesson plans, etc. However, I know that my engagement with the material and with the students would suffer. I also want to show the students that they are learning together with me and with each other, not with bots.

I am semiretired and have a light teaching load that gives me plenty of time to prepare for class. I can see that full-time teachers might find it hard to resist the lure of offloading their thinking to AI.

bradley131 day ago
I've been a teacher (most of the time a college professor) for...a long time. Nowadays, when preparing a new course, I definitely work with AI: "Here's what I want, and who my audience is - give me a course outline".

That gives me a starting point. Of course, I modify it. Maybe I bounce back and forth to the AI for further refinements and suggestions, but ultimately I have to be happy with the result.

When prepping the individual lessons, the biggest time saver is coming up with examples to illustrate particular points. I could do this alone, but sometimes that involves staring at a blank screen for a while. It is faster to ask the AI for suggestions, pick the one I like, and refine it further myself.

AI is a tool. Use it appropriately.

opto1 day ago
> AI is a tool. Use it appropriately

Yes, but no room is made for people who see no use for it. There is a forced-consensus that this technology is useful, which I have to combat against at work.

We teach in a very different environment, but your use sounds typical of my colleagues. "I ask it for suggestions and pick one", but nobody seems to wonder about what is lost when we shrink the horizon of what we will teach to the most likely outputs from a chatbot, one of which we will use.

Maybe this makes more sense in other fields. I have to prepare people to work in the shipping industry, in extremely dangerous roles where they will be operating heavy machinery, steering ships, driving cranes etc. The fact is that AI knows next to nothing about this field because an AI cannot experience handling a ship in rough weather, has never secured a boat to a ship's side with the rain and wind in its face.

Yet, when people are brought in to instruct our trainees, they are told to "tell AI what you want and pick one of the suggestions", in the best case, or just give over everything to the AI in the worst case. And nobody seems to be able to explain why this is a better way of working than sitting with a pen and paper, brainstorming some ideas for a lesson based on your real experiences, and then delivering it. The only justification I'm ever given is your one, "I pick from a list so I am really still in control", "it's quicker and I don't have to think as hard or as long", "it's better at making slides or writing good-sounding (to management and auditors) lesson plans". No-one ever seems to justify it by saying it is genuinely a better experience for the trainees.

generationP1 day ago
In pure maths:

- pre GPT-5.4: very limited use; some smart people got some mileage out of the models, but it always required serious work and a very suitable problem. Of course the models could solve homework problems, but that felt more like a downside to us who teach.

- since GPT-5.4 (Mar 2026): the "wow" release; suddenly answering MathOverflow-level problems that have previously been stumping experts. Still prone to hallucinations, but smart enough to use the built-in Python skill to verify its claims on small examples when possible. Probably a lot better at formula-heavy math than at the abstract "philosophical" kind.

- GPT-5.5: gave me a fascinating, significantly nontrivial and highly instructive "proof from the book" on an MO-hard problem that I'm in the process of writing up. Might have been luck and good prompting, though. Didn't really feel like a qualitative leap from 5.4, but I take quantitative any time. Still requires suitable problems, but it's much harder to rule out suitability from the get-go.

Claude and Gemini have been also-rans the whole time and still are. I use Claude for secretary-like tasks; occasionally it finds an easy proof too, but usually because I've missed something obvious.

Oh, and GPT, and to a lesser extent Claude, are great at hunting errors in maths. Probably 90% of my prompts so far have been for proofreading my writings.

vanuatu1 day ago
I work at a company that deploys AI to enterprises

The average office worker is amazed at Copilot (not in the IDE - but the app bundled with Windows), and they mostly copy paste material into their enterprise provided ChatGPT / Gemini, and get tips from Facebook / Instagram on their top 5 best prompts for work productivity

Showing them agents that automate work at scale is a very magical experience

dawnerd1 day ago
And then everyone that has to deal with their copy pasted output is too nice to say how bad it is and how much work it just offloads to the next person that’ll probably get frustrated and have an agent handle it.
conception1 day ago
Claude in Office was a tipping point for nontechnical folks around me. Everyone’s slides decks are immaculate now. Finance isn’t needing nearly as much BI help. It’s pretty impressive.
grey-area1 day ago
I find it really troubling finance are relying on LLMs (word generators!) for financial analysis - I mean I guess it means there will never be any annoying gaps in the data.
aidos1 day ago
Depends on how it’s done.

I use it a lot now for knocking up grafana charts etc. It’s not so much that the LLM is feeding the numbers through. You can still use real tools to analyse and summarise the numbers, it’s just much quicker at driving them.

As ever with data analysis, two things will continue to be true. Real insights come from spotting something that looks off and digging into it deeper. Secondly, it’s really easy to connect data in a misleading way.

I’ve had a Claude analysis handed to me this morning including a summary list of actions we’re going to take next which falls into this very trap.

The insights you’ll get from your data will only be as deep as the curiosity of the person at the helm.

Gigachad1 day ago
Can I get Claude to view the slide decks for me so I don't waste my time?
RobinL1 day ago
Interesting. I don't have to use PowerPoint much, but I hate it when I do. I don't want the llm to write the words but I do want it to make things look nice. So does this work well now?
angled1 day ago
My pipeline for this is vscode + prompts + markdown templates + GitHub copilot -> markdown docs -> pandoc to produce.docx -> copilot in word for “nice” formatting -> copilot in ppt for nice decks. LLMs all the way down.

I find it’s easier to version control and diff the .md artefacts, those remain my authoritative source.

jillesvangurp1 day ago
With a little bit of work, it works very well. You can generate powerpoint directly with Codex or Claude Cowork. There is also Canva support for these tools and it has its own AI integration. Another useful tool in this space is the Gemini integration in Google slides.

If you are a bit technical, reveal.js is actually really nice for this. I one shotted a pdf export for that uses a headless browser. I've used that a few times now.

What works well for me is to take an existing presentation and then some raw input and generate a new presentation in the same style as the old one from the raw input. After that, I can go in and tweak individual slides.

Another thing I did recently was take somebody's existing pitch deck and fix it with a one line prompt: "this deck is a bit meh, pimp it!" that worked unreasonably well. I like using shitty prompts like that. Codex often manages to do the right thing if you don't overthink your prompts.

Classic deck of somebody that used way too much text and only bullets. It did a great job on that presenting the content in a more simple and better structured way. Pulling out key facts and highlighting those, simplifying text, etc. Doing that manually would have taken hours.

ta89031 day ago
If you don't want an LLM to write the words, surely you also want to decide on the data and graphs to show by yourself? Isn't that 90% of a presentation? The "looking nice" part doesn't matter as much, it could be black text on a white background and it would be fine.

The important part is the presentation matching your presenting cadence, which is something LLM generated presentations never get right. I don't have a problem with people generating presentations, but most of the time they just end up reading whatever is on the screen when presenting.

angled1 day ago
In business: using coworking tools to review and propose filing of emails; manage my files and folders; on a daily basis scour the intranet for interesting and relevant content.

Personal: my wife tutors in her native language to non-native primary and high school kids. They are all using these tools now generate fresh content for practice based on school lesson plans. The kids are improving much more quickly now than they were just a few months ago.

beng-nl1 day ago
As someone who works somewhere where the intranet is a bit of a jungle: what tool do you use to scour the intranet?

Thanks!

angled1 day ago
Copilot Cowork in the M365 ecosystem. It inherits all the permissions from my account, has access to exchange to send me emails, and OneDrive to save each day’s summary for posterity and future refinement.
Antibabelic1 day ago
My day job is not in the tech industry. I am an editor. Literally nothing has changed for me in the last four years.
alexwwang1 day ago
As a former data scientist, I started to use code agent 3 monthes ago. Before that, I use chat completion on web. Now, I nearly do everything which outputs documents with code agent.
BOOSTERHIDROGEN1 day ago
Can you give a sanitized example or a hypothetical scenario of what you mean by “output documents with code agents”? Thanks.
schnitzelstoat1 day ago
I’m not him, but I’ve started using them to do the analysis (SQL, Python etc.) and then output the report as Quarto HTML which can be hosted on GitHub Pages. It works well for this analysis style work.

Once I was going to send some figures to leadership so I checked the queries myself and not only had it done it correctly, but it had also included a lot of sanity checks with other places in the database which as a human I doubt I’d have had the time or inclination to do.

Even for modelling work it can be good to check your ETL queries, or write one itself and then check it etc.

Quothling1 day ago
I think Claude Cowork through the Microsoft thing which was copilot but is now named M365 (or something?) is likely creating every powerpoint resentation within our organisation at this point.

We have whatever AI is in teams transcribe every meeting, and it's scaringly good at it. It's also extremely good at sumerizing or finding things from pervious meetings when tasked. One disadvantage in this, is that I can see how stupid I sound on writing. I'll go "yeah, hmm, yeah, that's, yeah", but it really is pretty good.

I assume we're going to see a massive increase in AI with this Cowork inside the Microsoft client. We actually have a better tool available through a librechat where you can create and configure your own agents with the same filesystem access to your one drive, and a lot more tools and models than just Claude. Almost nobody has been capable of figuring out how to use it though, so they've been using the regular office365 copilot and it sucks so bad that a lot of people stopped beliving in AI.

It's ironic that Microsoft fumbling the ball on AI, but being very good at enterprise customers (especially non-IT) means that they'll likely be the company which is going to sell us AI tools that people will actually use. I have no idea why it's so hard for people to pick up the Librechat tool we're given access to through our equity fund. It's quite litterally a copy of ChatGPT where you can point-and-click configure an agent, but we're seeing that even employees who use a lot of ChatGPT privately don't use this tool professionally. Meanwhile everyone has been capable of using the Microsoft thing (that I personally think is less user friendly since you will need to add your configuration files to every promt).

piokoch1 day ago
"I have no idea why it's so hard for people to pick up the Librechat tool we're given access to through our equity fund"

That's because M365 is integrated with the whole Office/Exchange environment, especially in terms of security policies, etc. MS also guarantee that the data are private, this is very important for many companies both from the IP protection perspective and the liability to expose some users/customers data (think of GDPR regulations is Europe).

I don't know who is behind Liberchat, probably some good and friendly folks, but when it comes to privacy/security Microsoft has much more to loose and if shit happens it is easier to sue them than some random VC-financed company from the USA.

Havoc1 day ago
At work the tools handed to most are still essentially chatbots. Getting access to coding tools is an uphill battle because there isn’t really a good way to manage risk yet. Hard enough to keep a coding agent in check locally and ensure it does rm -rf anything. Scale that to thousands of people with limited skill and it doesn’t really work. So currently they just don’t.

That’s in a finance shop. I’d imagine it’s different in programming shops where handing people Claude code is a bit more plausible

TrackerFF1 day ago
Purely anecdotal, but in my team of 20 data analysts, we've seen a bunch of them become quite productive in producing tools and apps. These are analysts with mostly domain knowledge, and not so much programming knowledge - meaning that they knew the basics to write scripts, and wrangle data programmatically, but not enough to actually engage in software engineering.

Some of these are now contributors.

I also have a friend (beware, N=1 study) with zero prior programming knowledge that has released his first app.

cold_harbor1 day ago
for non-coders: local AI. a couple years ago you needed a dedicated GPU rig. now a 30B model fits on a laptop and runs offline.
lopsotronicabout 24 hours ago
I've always been a "power user", making little python programs and figuring out new ways to do things with seemingly unrelated systems. My knowledge is shallow, but very broad.

A year and a few jobs ago I was genuinely up against a wall I could not see breaking through, not if I wanted to ever sleep again. Hundreds of completely bespoke customers. Hideous archaic tooling. Two of us. It was bad times. So I started paying for Claude - desperation move, to try and vibe my way out. Honestly, it's been a little bit like having superpowers.

Not just code generation, which has been great, but gaining knowledge and understanding with incredible velocity - sort of like how RSS felt back in the day, or when Google stopped being worthless in the very end of the 20th C. When Wikipedia started.

So where am I now? Well, I ditched the hell job (I didn't really drink the koolaid of their "Enterprise Solution" anyway), and got a regular day job in my core competency. I guess I do a lot of what is called "vibe coding", all kinds of utilities, what I call my "extracurriculars". A graph view for Asciidoc in VSC to show includes, xrefs, partial includes. Graph view for everything actually - it's surprisingly insightful for PDM and config management. Analysis tools for sensor faults based on Python open source astronomy tools. All sorts of converters and aggregators and cleaners for a devil's piss bucket of enterprise systems. A bazillion new MapTools macros for gaming, making complex RPG systems nearly pushbutton. A little harvest of local LLM systems doing all sorts of things, like my "Reviewinator" for copy edit. I could type the rest of the day and wouldn't come close to the end of the list.

So, pretty amazing. Very interesting systems with what must be some N-dimensional geometry underlying, maybe a signal to an underlying principle of emergence. Who knows?

In the long term, it's going to be Enterprise Software that eats the big losses from these systems. For all sorts of reasons, but mostly because Enterprise is where software goes to die. It's all bespoke to hell, it's all ancient, no one is working there because they want to. So a domain expert, with AI assist and a little know how, is probably going to whip up a superior set of tools in a short enough time to make it really worthwhile. Watch that space: SAP, Siemens, Teamcenter, SalesForce. Watch their consulting revenue.

verdverm1 day ago
They lag behind because we build for ourselves first. We are rolling out Claude to the biz team this week and they will get access to Cowork, which is still preview aiui.

Sales will be another big user of agent automations, for better or worse. Poor usage by Google to craft emails and slides for us is why the suits are getting an Anthropic sub. Stay human in the loop my friends!

shepherdjerred1 day ago
> and there’s zero chance any AI lab would train a model for such a ridiculous task.

I'm not sure that's true anymore considering how popular Simon's blog is

_puk1 day ago
> So maybe the AI labs have been paying attention after all!

> I think this mainly demonstrates that the pelican on the bicycle has firmly exceeded its limits as a useful benchmark.

As acknowledged in the article.

kzrdude1 day ago
Gemini 3.1 basically takes it home on that benchmark, anyway, it's done.
sunaookami1 day ago
Gemini is heavily benchmaxxed and sucks in agentic coding so no surprise.
nickvec1 day ago
Simon mentions further along in his article that given Jeff Dean’s post referencing the pelican-riding-a-bike task (and how good current models are at doing it), that it’s no longer a great benchmark to use. Enter the opossum riding an e-scooter!
aaronbrethorst1 day ago
Banana man on the Segway
simonw1 day ago
That bit probably works better in the talk, it was a setup for a joke later on.
muzani1 day ago
It's practically a benchmark now. Some friends have been specifically training models to count the R's in "strawberry"
jimbobthemighty1 day ago
I asked Gemini for a video of 'pelican riding a unicycle in hyde park' - I was blown away by the output:

https://gemini.google.com/share/55e250c99693

swed4201 day ago
According to OP:

> Why this test? Because pelicans are hard to draw, bicycles are hard to draw, pelicans can’t ride bicycles... and there’s zero chance any AI lab would train a model for such a ridiculous task.

At this juncture I'm left wondering why competing AI labs wouldn't train for this now well known "test".

nijave1 day ago
Given their proclivity to scrape the entire contents of the internet, it's only a matter of time intentional or otherwise.

I've heard the same has happened with common benchmarks (they've ingested solutions into training data)

IdiotSavage1 day ago
Graphically perfect, but content-wise nonsense. The pelican's center of gravity is clearly behind the wheel. It needs to be above or very slightly ahead of the wheel.
horsawlarway1 day ago
I don't think it's graphically perfect either.

The length of the pedals keeps changing, and you'll notice that neither of the pedals actually rotates around the hub: consistent with your point about the center of gravity being too far back, the circle the pedals are making is also shifted back too far.

navane1 day ago
Oh those pedals go all over the place indeed
ciberado1 day ago
Still impressed. And, to be honest, I don't think that this problem matter much. Physical accuracy is very nice, but for example is not the most important aspect when I watch a fantasy movie. Or even a scifi one.
djeastm1 day ago
Maybe the pelican has something heavy in its mouth.
mycall1 day ago
I do hope that JEPA can help resolve the nonsense from AI models.
sfdlkj3jk342a1 day ago
I'm surprised by Grok as well:

https://grok.com/imagine/post/8d1eab88-737f-4d46-ba92-9b6502...

Interesting that it does better at making the pelican peddle in the video generation than in image generation.

nijave1 day ago
Google/Gemini has pretty impressive audio visual capabilities. I tried to have Claude add mulch to a landscape picture and it looked like someone hit it with the orange spray paint tool in MS Paint. Nano Banana actually produced something fairly realistic
grey-area1 day ago
That’s really impressive, and slightly worrying for creatives involved in film, animation or modelling.
notachatbot1231 day ago
Even more worrying are the implications for fakenews, propaganda, fraud, deception and mental health.
sevenzero1 day ago
This is really my biggest worry when it gets to consumer AI. People already have a hard time informing themselves properly. Now we have technology that just boosts the already existing confirmation bias people have. It's sickening.
dzhiurgis1 day ago
Maybe short term yes. But longer term people will finally put their guard up against deception that’s been around for decades.
drdaeman1 day ago
It’s the opposite, non-creatives (if such roles even exist in those industries) should be worried. All those models offset technical skills, allowing to get from idea to implementation through a different route (which can be easier or harder depending on idea and model - good luck tweaking that pelican’s exact pose and movements to match your imagination precisely). Nothing touches creativity, not even in the slightest.

But there’s a lot of panicking, fear-mongering and all sorts of nonsense around this whole subject.

Retric1 day ago
My mother has started watching 100% AI generated stories on YouTube. They are good enough to be entertaining even if they include random errors like messing up the main character’s name.

The thing is the creative economy is all about people’s attention and pocketbooks, it doesn’t need to be great just good enough.

colinb1 day ago
The truly excellent weavers will be fine?
grey-area1 day ago
That’s really not how this is going to play out.

When advertising agencies for example see that their copywriter can go from idea to concept with a video generator instead of engaging an animator, they’ll simply cut the middleman who used to create that animation for them and use the tool instead, even if the content isn’t as good (though the quality of this one is really pretty good, there are obvious problems). They’ll happily accept mediocrity to save money.

People will still create adverts but quality and creativity will go down and a lot of jobs are going to be suddenly displaced.

flakeoil1 day ago
Does "creative" mean that you are creative at coming up with ideas or does it mean that you are artistic and can create stuff?

I suppose it is more the latter, and it's the artistic people who create stuff who will suffer. The ones coming up with ideas, but previously couldn't create becasuse they lacked skill might win thanks to AI.

Coming up with ideas is easy, creating and putting in the effort is hard (until we had AI).

Probably the value of created stuff will go down rapidly because there will be so much of it.

AussieWog931 day ago
I wouldn't be that concerned that animation is going anywhere. Both outputs look really off, especially around the feet.
wongarsu1 day ago
In a serious creative tool you would also want a lot more creative input. At a minimum the ability to steer the animation with skeletons that feed into a control net, or something like that. And the ability to control the look and feel and create much more consistent characters. Both things that exist in good tooling, but both things that create work that will keep animators employed. But it will dramatically reduce the number of animators needed to reach a given level of "good enough".

And looking at the trajectory of the animation industry, I don't think increases in productivity will be used to raise the quality of the animation if the alternative is to just pay fewer animators

grey-area1 day ago
Yes sure if you look closely it’s slop, but a huge number of companies and advertisers just don’t care (and they feel the same about their social media content, blogs and yes code) - they will attempt to cut corners where they can to the detriment of true artists.

But yes, for anyone who does this for a living there will be obvious deficiencies, esp when you try to do something truly novel, intentional and interesting and don’t quite want what it produces.

But in this area they have made quite a lot of progress.

hackable_sand1 day ago
It's really not
ionwake1 day ago
only SVG counts tho, dont know why
falcor841 day ago
Willison chose this task because (unlike actual images of pelicans) is was clearly not in training data, but could be reasoned about and composed from what's there. But just like those "how many golf balls can you fit in a 747?" interview questions, it should now be retired.
ionwake1 day ago
Thank you for the reply. Would something like a Squirrel flying a hangglider as an SVG be a good new test? Or would that be indirectly in the training data too?
simonw1 day ago
It's a test of text-based LLMs to see how good they are at SVG geometry. Video models are a different category of software entirely.
tptacek1 day ago
If you're a vulnerability researcher or a security person generally, there's a big inflection point from Spring of this year.
gnyman1 day ago
If it turns out to be a good change or not is to be seen.

The half-full view is that the models are so good at finding vulns that if you plug them into your build-pipeline then the amount of new vulns introduced will go down towards zero.

The half-empty view is that we're now producing more junior-level code with less review, so everything will have more vuln, also it's cheaper and easier to find them so prepare for chaos.

Short term there is sure to be chaos either way as the models are clearly good enough to find all the old bugs, and not everyone has the resources or will to try to stay ahead of the curve like Mozilla is trying to do with their Mythos access https://blog.mozilla.org/en/firefox/ai-security-zero-day-vul...

muvlon1 day ago
There's a major caveat to the half-full view: You'll only stop adding new vulns that your model can find.

A threat actor with access to a better model or more money to burn on tokens may yet find more. Some of them have deep pockets, and not nearly every project will get the Glasswing treatment of free Mythos tokens.

spacebanana71 day ago
There's an interesting economic contest here as well - is it more sustainable for a malware group to spend $500 in tokens looking for an issue in my app? or for me to spend $500 scanning for issues on every deployment?

Systemically this usually favours the offence, as they could scan my app once every 6 months whereas I'd need to do it on weekly releases.

jxmesth1 day ago
I'm a security person and would love to hear other people's input here as I don't have that much experience with this
thierrydamiba1 day ago
Can you be more specific?
tetha1 day ago
Three deterministic Linux LPEs in a week, an LPE in BSD in execve (of all things...), nginx vulnerabilities, one or two new gnarly supply chain attacks. Linus noting that the linux-security mailing list is getting flooded with duplicated, AI-driven reports of varying quality. There are pretty crazy keycloak vulnerabilities getting discovered.

We're most likely entering a year or two or rapid vulnerability discovery, patching, as well as reducing and minimalizing system footprints just to survive the onslaught of strange vulnerabilities from e.g. ancient and widely unused kernel modules.

simonw1 day ago
The Claude Mythos / Project Glasswing thing is real: https://www.anthropic.com/glasswing

I met a few people at PyCon this week who have been part of Glasswing (they're just starting to be allowed to talk about it) and it really does drive down the cost of finding vulnerabilities.

I've been collecting notes on that here: https://simonwillison.net/tags/ai-security-research/

krzyk1 day ago
People in my company sounded underwhelmed by it. It usually was founding issues by not understanding deployment (or not being fed that info).
Gigachad1 day ago
Wouldn't it drive up the cost of finding vulnerabilities when all the low hanging fruit has already been scanned and patched? Like the new baseline for finding a vulnerability will be something an LLM couldn't find.
tptacek1 day ago
Broadly, I'm talking about the shift from building elaborate vulnerability research harnesses towards using the frontier models and their RL-optimized harnesses to build simpler vulnerability discovery pipelines. And then: the ensuing carnage.
baq1 day ago
Not op but just look at HN posts in the last couple weeks: supply chain worms, zero-day LPEs for all OSes seemingly every other day, researchers on X and here openly saying they’ve got more valid findings than they know what to do with
nickvec1 day ago
Are you referring to Claude Mythos?
pineapple_opus1 day ago
All I see is mention of how various models generate image of "pelican riding bicycle(s)"
emil-lp1 day ago
Yes, the "pelican riding a bicycle" is the ultimate test of not understanding how LLMs work.

Well, a combination of that and believing that replication of test data is a good measure of progress.

vessenes1 day ago
Spicy — why does it show ultimate non-understanding?
JohnKemeny1 day ago
because success comes from reproducing a memorized pattern rather than transferable reasoning?

At the same time failure proves little because most humans also could not manually create a correct SVG of a pelican riding a bicycle.

What is it exactly that such a test is testing?

In which situation would you measure the "competence" of a human being by asking them to write an SVG of a pelican riding a bicycle?

ClikeX1 day ago
We all know the true test of AI is Will Smith eating spaghetti.
pr337h4m1 day ago
Something that’s largely been ignored: DeepSeek has made context caching virtually free with V4-Flash.
hootz1 day ago
I swear to god that DeepSeek V4-Flash is the most useful model available right now. It's SO FAST and is good enough for so many tasks that I run it most of the time for almost everything. Even when it messes up, it's so cheap to iterate that I can fix most problems without changing the model to a more "capable" one.
Advertisement
chrisss3951 day ago
How much of what is being generated by LLMs is actually value add? My perception is there are lots of great experiments, but little real value.

+ Developers are more productive, but are you all leaving work at 3p and enjoying a new found sense of work-life balance?

+ Companies are investing heavily in AI, yet I'm paying more for the same thing. Jamie Dimon still pays me 0% on my checking despite spending billions on AI.

It may be that simply adopting AI isn't enough. Could new startups that are born-in-AI buck this trend? I wonder what Clayton Christensen would say if he were still around.

ben8bit1 day ago
The tooling has become so good though - the eco-system around the LLM. The models have become really good, yes - but it's definitely slowed in my opinion. The tooling is what really has become great - "harness" is probably the best word. When folk like Elon/Schmidt/Theil/etc. talk about singularities and industrial revolutions - it sounds extremely out of touch - or actually protective of the massive capex they've potentially sunk.

EDIT: Schmidt's booed commencement speech was probably one of the most out-of-touch speeches (outside of a tech interview) I've heard.

sharperguy1 day ago
Much of the recent improvement in models is in being trained specifically to make use of the tools the harnesses give them.
ivandotcodesabout 23 hours ago
Reading through the thread, a lot of the inflection point debate seems to come down to people talking past each other about what got better. My read is that the models themselves didn't really jump in capability around November, but the harnesses around them got considerably more reliable, and the RLVR work earlier in 2025 had been training the models specifically to behave well inside those harnesses, so when the two met you got a compounding effect that felt like a step change even though neither piece was that dramatic on its own.

I think that's probably why everyone in this thread has such different experiences - someone whose workflow is mostly asking a model for code and pasting it in would have seen modest improvement and would reasonably wonder what the fuss is about whereas someone who was already running agents on 20-step loops would have felt a much bigger shift, because the thing that used to kill those runs was the failure at step 12 cascading into garbage by step 20, and that got a lot better.

The local model story Simon kind of glosses over is interesting for the same reason - a 20GB model drawing a decent pelican on a laptop is a cute data point in isolation. The thing worth noticing is that a competent local model inside a good harness now gets you closer to frontier performance than running the frontier model without a harness does.

throwaway20271 day ago
December 2025 was the breakthrough for me. January Claude was euphoric, ChatGPT was up there. February Gemini cooked for a second there. March amazing. April the big bad nerf. May GPT 5.5 is just pure bliss altough 2x limits temporarily, not sure about Claude it's sort of okay still not as good as it felt before, slowly increasing limits with more compute and rebuilding good will.
dmpk2k1 day ago
I find your emotional language truly quite fascinating. I've heard people talk like that about drugs.
OvervCW1 day ago
I actually thought it was a joke comment, but I'm worried now that it's not the case.
wilg1 day ago
Similarly, I've heard people talk like that about things that are not drugs.
sph1 day ago
You can get a dopamine rush from anything, from drugs to using LLMs.
_puk1 day ago
I think Opus 4.6 at its peak was the "how can anyone not get that this is good" for me.

Then the nerf, and the massive uplift in tokens for 4.7, a model which I find lazy and prone to hallucinate.

It's probably time to try GPT5.5. Like many I'm pretty heavily invested in the anthropic ecosystem at this point, which I suppose gives another strong reason to make the switch.

vessenes1 day ago
The openclaw ban pushed me over to 5.5 for some daily usage. I feel like Opus and 5.5 are good at very different things. 5.5 can be too literal, and it does not have as much of a ‘creative’ bent whether that’s toward design, UI/UX, interpreting vague instructions, etc. So, in that way, Opus had sort of spoiled me.

On the other hand, this year I’ve been in the habit of using codex as a bug finder / audit layer, where it shines, and I can tell you, Opus makes a lot of mistakes, and as we all know struggles with laziness — and has gotten good at encoding that laziness into the codebase (// Per instructions, pass this test by default) where it can live for a long time. So, Opus had spoiled me, but more with its ability to sketch holistically than its ability to put out perfect codebases.

Upshot - it was good to switch horses for a while, as you mention. Slightly different skill sets there. And I still reach for claude especially for initial design. But right now the daily driver is 5.5 / xhigh fast mode, and it’s very capable.

Arn_Thor1 day ago
I was a dedicated Claude user but in March/April I started using GPT5.5 on a new project that Claude had tried and failed to execute successfully. GPT knocked it out of the park, and was able to do it within my subscription allocation of tokens. I'd recommend giving it a go at least. Something like OpenClaude can let you use the Claude tools you're used to
ant6n1 day ago
I only used Claude first time in April, previously only ChatGPT and Gemini. And I struggle to see what the hype is all about - yes it seems a tiny bit smarter than the pack, but on the 20$ subscription it runs out of tokens in 5-20 minutes, and then you need to wait 3-4h.

ChatGPT 5.5 seems capable, although a bit stingy with “thinking” compared to earlier models, and I never run into session limits.

_puk1 day ago
I couldn't imagine using CC on the basic tier!

Even operations and GTM are all at "professional" level (which I think is vaguely equivalent to 5x).

eloisant1 day ago
About Pelicans on bicycles:

> there’s zero chance any AI lab would train a model for such a ridiculous task

Well, I think this guy's tests have got enough visibility that I wouldn't be surprised if some AI models are trained on it specifically...

shantnutiwari1 day ago
yeah, simon's blogs have been on the front page multiple times now, I wouldnt be surprised if all of them added s apecial case for it
Shocka11 day ago
> One of my projects was a vibe-coded implementation of JavaScript in Python—a loose port of MicroQuickJS—which I called micro-javascript. You can try it out in your browser in this playground.

I'd like to remind everyone here that people on this forum used to actually code truly remarkable and pointless stuff like this, with zero LLMs, using nothing but their brains and motivation from who the heck knows where from.

rTX5CMRXIfFG1 day ago
Am I crazy, or are these differences between the best models so marginal that you’d get roughly the same performance if you use the same high-quality harness (ie preloaded instructions from md files, including custom skills)?
bluegatty1 day ago
You will immediately notice the difference if you use it at the threshold.

It's like most people just watching a 'starting nba player' (not superstar, but just starting player) vs one that sits on the bench.

If you were to just watching them play, work out, shoot - you'd never notice the difference.

Put them head to head and it's 98-54 and you start to see the patterns.

It's pretty interesting actually, someone tell me what the 'science' for this is, I'm sure there is some kind of information theory at work here.

Software has innumerable kinds of problems at varying level of complexity and so it provides the perfect testbed for seeing how far models can go in practice.

Should add: you're very right to hint that harness, tooling, and models tuned o both the harness and he kinds of things people do on the harness, as well as some other things do make enormous difference.

Bu and large, SOTA Codex/Claude Code are substantially better - at least for now. That may change.

dnnddidiej1 day ago
Head to head is interesting. I had not tried 2 agents on the same task simulateniously with 2 models.
Sparkyte1 day ago
No you're not wrong. Many people will see what you see. Enthusiasts will see it as monumental squeezing out that last drop of performance. In my opinion I think it is okay for enthusiasts to feel that way. I'm just satisfied with getting a tool as an aid.

Personal opinion we need to focus more on efficiency instead of how large or complex a model can get as that model creeps into more resource requirements. If the goal is to cost a billion dollars to operate than we've really lost the idea of what models are supposed to be achieving.

raincole1 day ago
By definition the differences between "best models" are small. It's tautology. If a model is significantly dumber than the others then it's not one of the best models.
minimaxir1 day ago
To an extent. I've had GPT 5.5 solve problems that Opus 4.7 struggled with, using an identical AGENTS.md/CLAUDE.md and no skills.
nl1 day ago
The difference is very noticeable as your codebase gets bigger and you give higher and higher level tasks.

I've certainly had things that Opus fixed using some kind of work around that GPT-5.5 actually solved.

And the difference between the Sonnet/Gemini/DeepSeek tier to the Opus/GPT-5.5 tier is immediately obvious.

mrothroc1 day ago
I have the same experience. I've been running sequential agents in my own harness that is a standard SDLC pipeline (plan, design, code, build, test). It has gates between each stage to control quality.

The big benefit of automating this for so long is that I have lots of data. I analyzed it and found that I can change the models out without much of a change in the output quality.

For one-off tasks, where there is no harness and you're just YOLOing with the TUI, yes, big difference. You need a harness.

The pipeline controls the quality far more than the model, empirically.

Hfuffzehn1 day ago
You have correctly identified that getting a "high-quality harness (ie preloaded instructions from md files, including custom skills)" is the (or at least a) hard part.

Because you have to adjust the harness to your problem space and provide that so you can say it is high-quality.

Many people will stop that discussion at the claude code vs. codex vs. opencode level and then merge that with discussing model performance.

And that is also why "Generate an SVG of a pelican riding a bicycle" is still a benchmark worth discussing. Because at least it is a defined problem space.

grey-area1 day ago
Haven’t noticed much significant progress in LLMs myself in 6 months (significant as in new or vastly improved capabilities or understanding, not new releases, there are plenty of those).

I feel like if anything people started to realise the significant limitations of LLMs when you try to use them as ‘agents’ which was the big direction LLM companies tried to push recently.

Best use of LLMs so far IMO is finding vulnerabilities (with human help) and pattern matching in other domains. For generating code and prose they are still mediocre and somewhat unreliable and for use as personal assistant agents I wouldn’t trust them.

So what’s happening with openclaw, the biggest experiment in agentic, vibe coded by the agents themselves? The thing that was so hot a few months ago.

https://github.com/openclaw/openclaw/pulse?period=daily

279 commits to main from 77 authors in the last 24 hours.

Why is there so much churn and how could you trust it with your data? This is changes in ONE day!

If these are useful changes, surely it’d be superhuman by now given months of this pace.

What are people using this for?

alain940401 day ago
We all have had the client from hell: they don't know what they want, they change their requirements all the time. Whenever they have a new half-baked idea, I need to scramble and re-design the architecture. They have no clue that a small change request has a big impact on the code.

Well... Now I can be that client. And let AI deal with my incomplete, always changing requirements. And get it done anyway.

max_unbearable1 day ago
The honest summary that doesn't show up in the six-month roundup: the unevenness. Boilerplate, tests, scaffolding, glue code: dramatically faster, sometimes 5-10x. Architecture, data modeling, careful security work, judgment calls about what to build: same as before, sometimes slower because tab-completion sneaks in plausible-but-wrong defaults you then have to undo.

The thing headline numbers ("AI made me 3x faster") hide is which 30% of the work the AI sped up and which 70% didn't move. For a solo dev the survivable bet got smaller, and that's the real change, not raw productivity. AI made certain projects worth attempting at all that wouldn't have been viable six months earlier.

Advertisement
exabrialabout 23 hours ago
There's also an inflection point in Feb-April: Claude got considerably worse, and arguably has not really recovered since then. They claim it's fixed, but my experience it is not as great as it once was. 4.7 is still useless.

Waiting for the next event at this point. Hoping that "inference becomes cheap" when Groq hardware gets delivered.

jonnyasmarabout 16 hours ago
This is most impressive because the last 6 months in LLMs has actually been more like a hyper-compression of decades of tech progress.
vishal_new1 day ago
what are your thoughts on Software engineer replacement. My team has already seen big reductions. Q/A team is gone. Software Engineer reduced by a third. Scared for the future
simonw1 day ago
Ditching the QA team when the single highest challenge is verifying that vibe-coded systems do what they're meant to is extraordinarily short-sighted.

Personally, the more time I spend working with coding agents the least worried I am for my career. Getting the best results out of them is really hard. They amplify existing skills and experience, so the more experience you have the better.

kenloef1 day ago
I believe that many of those saying that they "never write code anymore" or are experiencing "10x productivity," are heavily underestimating (or outright misrepresenting) how much they are guiding the model, and ignoring everything else that goes into shipping fit for purpose software. I frequently see zero measurements or factual arguments supplied to support such claims. I also see many people say that they are "vibe coding," when they are almost certainly reviewing, editing, or otherwise steering the output.

I wonder why there is such a mad dash to trump up the capabilities of coding agents. And why such loose terminology and lack of rigor? I thought programmers were supposed to be rational people (har har!)

koonsolo1 day ago
Have you seen the automated tests that QA members deliver? My experience is that they are horrible, and it's not so hard to beat that low quality bar with an LLM.

I have a theory: if they were good at writing automated tests, they would have been developers instead of QA engineers.

Not saying that there aren't any high quality QA engineers, I worked with some. But LLM's raised the bar in a way that most QA engineers can't reach.

simonw1 day ago
Yeah, I don't think the role of QA is to write automated tests - developers should be doing most of that work.

The best QA people I've worked with didn't write much code at all. You'd give them a new system and they'd find all of the bugs, testing obscure edge-cases that you'd never thought of.

Mashimo1 day ago
Huh, never thought about QA writing unit tests.

In my limited experience they write test cases, test each story, do regression test, verify bugs from customers. All by hand.

At my current job I don't want to miss them.

ShinyLeftPad1 day ago
If you're famous, you'll be fine. If you're in retiring age, you don't care. Otherwise, good luck! We put ourselves on the street by not protesting what is happening.
munksbeer1 day ago
I think the general population earning median wages will have very little sympathy for first world software engineers earning vast amounts of money.

What are you going to tell them? Suddenly you're earning what they're earning for sitting at a desk every day?

ShinyLeftPad1 day ago
General population, you mean non SWEs? Because there are many SWEs around the world who earn median wage and who stand to lose it all as the avalanche of firings is ramping up.

Non SWEs (salespeople, clerks, secretaries, assistants, taxi drivers, writers, 3D modelers, artists, designers) are of course going the same way. Unless they are protected (unionized or such), why would they have sympathy for SWEs? People of our ilk are the ones causing this (to them and to ourselves). What I will tell them is to not repeat our mistake, organize and protest.

vanuatu1 day ago
I think there will be larger markets, more companies, more jobs than before due to AI, but also a very painful transition period

AI reduces the cost of producing software (and other intellectual tasks), which greatly improves the viability for more and more ambitious projects. As far as we know the amount of problems software (and humanity) can solve is unbounded

It feels like the market has shifted in SWE yet again to heavily prioritize a new set of skills, of which those in the top quartile are desired more than ever

asdff1 day ago
The problems in any domain are infinite. But, alas, money is not.
trojans12901 day ago
What are these skills?
stuxnet791 day ago
This is the magic question that I'm very eager to hear the answer to.

Fundamentally, steering LLMs requires the same structured, logical thought process that is required to write code, regardless of abstraction level. Unlike what HN would have you believe this is not a skill that is equally distributed across the population.

But given the rapid pace at which this technology is evolving, "steering" may very well be ceded to the clankers. LLM agents are fantastic at logical reasoning & any inefficiencies relative to human experts can be circumvented by sheer compute.

koonsolo1 day ago
Being able to work with an infinite amount of dumb interns that work super fast and have a vast amount of knowledge.
empath751 day ago
There is an entire category of software engineers who exist entirely to knock out features on microservices or do easily automatible QA work whose jobs will disappear.
ramon1561 day ago
> Google released the Gemma 4 series of models, which are the most capable open weight models I’ve seen from a US company.

Implying another country has a better model? I'm being pokey here because I'm very curious! I know Gemma is efficient, but I also remember Qwen and Kiwi being referred to as optimized. The difference being that Gemma is using less tokens, but maybe Qwen/Kiwi's quality is higher? I dont know.

pferdone1 day ago
The consensus right now is that Qwen3.6 in its 27B and 35B-A3B versions is better for coding whereas Gemma4 is stronger when it comes to OCR, audio transcription and the likes. Margins are slim though and the harness at these model sizes is the most important factor.
0xCMP1 day ago
In my experience the qwen models are best locally, but gemma ones have always been good. gemma4 is a notable improvement.
kramit12881 day ago
top model changes every other month between Claude, GPT and gemini. but its dominated by GPT overall. Claude has taken lead in coding task but GPT 5.5 has come stronger. gemini was good in between. but its dominated by GPT 5.5 and claude overall. Coding is the area where disruption is hardest. Opencalw early this year was a major breakthrough in agentic AI and it is still making noise and becoming more mature and going toward enterprise. Agentic coding is still in adoption phase where teams are trying it , trying to make sense out of it, running it and not beleving it and eventually it is discussion point over tea. it is still in adoption phase but needle has moved from being alient to being something real which team started discussing and using it like a champ.
dnnddidiej1 day ago
Also LinkedIn wars of people trying to claim throne as most AI-pilled, throwing down strawmen stories of luddites yelling at data centres who'll lose their job to a single person doing 100x work.
pamcakeabout 21 hours ago
> I put together these annotated slides from my five minute lightning talk at PyCon US 2026

Is there a video or audio of this talk?

bschwindHNabout 22 hours ago
ionwake1 day ago
why is there no talk about the world is already run by AI by proxy? ie bureaucrats using chatgpt to make their speeches decisions shopfront designs etc. I just dont seem to read about this, intead its more this nebulous specific date in the future
epolanski1 day ago
I'm tired of the pelican bench, it made sense in the beginning, but at this point it got too popular and old to consider the assumptions from one year ago (absence in the sample/training/reinforcement) to still hold.
Advertisement
LarsDu881 day ago
My goal post for "AI will definitely replace most SWEs" was to reproduce a particular 90s programming game one shot and then add multiplayer support with minimal prompting.

Opus 4.5 hit that point in November.

grey-area1 day ago
I tried this a while ago, haven’t tried again recently. The models were producing code that was clearly lifted from stuff in their training data, and what I ended up with was a fairly decent game in html and js after a bit of tidy up, though it felt like several code paradigms smooshed together rather than a coherent whole, but it mostly worked. Not something I’d want to maintain but it was impressive at the time.

They were able to one-shot famous games (like asteroid or pong), I suspect because they had been trained on multiple versions of that game. So like producing Harry Potter, with the right prompt it was able to produce a license stripped version of code it had seen. I tried another arcade game like frogger and it failed really badly and took a lot longer, never got it working.

The whole exercise left me feeling they have a long way to go, I don’t see how anyone could think they would replace SWE unless they didn’t look at the code produced, even now.

vessenes1 day ago
Out of curiosity - what harness did you use, and what model? And how are you prompting? In my mind prompting like:

“You’re going to make frogger in javascript. I want a complete clone of functionality for level 1, with amazing 80s era pixel art sprites. I’m super lazy, so you’re going to have to test everything, right from the start. Pick a test harness, write the tests, including tests for having amazing graphics, gameplay, input, UI, sounds, etc, and write a full workplan, then work through that workplan, in parallel where you can. The workplan should emphasize getting a stripped down version up immediately and have workstreams for all the major requirements after that. Add a final test that assesses how fun the game is by reviewing a real video of a test run. Loop on that final test until you can’t improve things any more.”

Should produce something playable with no further input. As you say, I’m not sure it would produce a codebase we’d want to look at or work on. But, I’d be surprised if this weren’t successful.

grey-area1 day ago
Sure give it a go, perhaps it will work better now with frontier models, I haven't tried it in a while (this was a year ago, things have improved since then). I'm not sure what tests for having amazing graphics, gameplay, input, UI, sounds, etc would look like, but it would be interesting to see the results!
abstractbill1 day ago
"... though when it tried to animate it the bicycle bounced off into the top and the bicycle got warped."

Should be the pelican bounced off.

bluegatty1 day ago
'Producing Images' or even 'Some Code that is Valid and Compiles' is in some ways one of the most misleading ways we assess quality of the AI.

It is getting very good at producing code that compiles - at the algorithmic level.

This is definitely noteworthy - and the AI is crossing a critical 'productivity threshold'.

But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.

Everyone has tried to get AI to 'Draw The Thing They Want' and you notice immediately how it's almost impossible to 'adjust the image' along the vector you want - because ... and this is key:

-> the AI doesn't really understand what a Duck is, it's components, or fully how it made the duck <-

It just knows how to 'incant' the duck.

This becomes very clear when you try to get the AI to write proper documentation - it fails so miserably, even with direct guidance.

This is really strong evidence of how poorly the AI is generalizing, and that it is not 'understanding' rather it's 'synthesizing' from patterns.

We already kind of knew that - but we have not yet built an intuition for that until now.

Only now can we see 'how amazing the pattern synthesis' is - it's almost magic, and yet how it falls off a cliff otherwise

This has deep implications for the 'road ahead' and the kinds of things we're going to be able to do with AI.

In short: the AI is 'Wizard Level Code Helper, Researcher, and Worker' - but it very clearly lacks capabilities even one level of abstraction above the code itself.

LLMs were first trained by 'text' and now ... they are 'trained by our compilers'. Basically g++, javac, tsc are the 'Verifiable Human Rewards' in the post-training and reinforcement learning - and the AI is getting extremely good at producing 'code that compiles', but that's definitely an indirection from 'code that does what we want'.

It's astonishing that it took us all this time to internalize and start to discover what I think will be in hindsight a very obvious 'threshold' of it's capabilities.

We are constantly 'amazed' at the work that it can do, and therefore over-project it's capabilities.

I have no doubt that even with these limitations - the AI will unlock a lot more as it gets better - and - that it will 'creep up' the layers of abstraction of it's understanding.

But I strongly believe that the AI is going to get much 'wider' (pattern matching dominance) before it gets 'higher' (intrinsic understanding) - and - that this may be a fundamental limitation.

This may be 'the Le Cunn' insight - when he talks about the limitations of LLMs in detail - I believe this is that insight writ large.

Even the term AI - or certainly 'AGI' may be a misleading metaphor - were we to have always called it 'Stochastic Algorithms' or something along those lines, it's possible that our intuition would be framed a bit better.

The most interesting thing is how it is definitely amazing, world changing, novel and powerful and some ways - and obviously useless in others at the same time. That's the 'threshold' we need to better understand.

nl1 day ago
> But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.

That might be the case, but Simon's case "Generate an SVG of a pelican riding a bicycle" is very different.

The model actually has to understand what parts of a pelican and bicycle come together in something like an anatomically plausible way. That's a higher level of abstraction than something like passing the same prompt to Stable Diffusion etc

(The new Nano Banana/GPT Image 2.0 models are different though - they have significant world knowledge baked in)

bluegatty1 day ago
"That's a higher level of abstraction"

No, it's not because it's seen 'anatomy' for Pelicans, Animals - even how it's represented in Animals.

If you try to get the AI to actually decompose it and start to 'draw pelicans' in very obscure ways, it will immediately fail.

Try to get the AI to draw the pelican form a very odd angle - like underneath, to the right, one wing extended, one wing not ... 0% chance.

Precisely because it does not understand those things.

FYI it's a slightly unfair case because it does not have 'world model' yet, which will actually solve that problem, but even then not through very much abstracting.

We're a long way away - but in the meantime, there's lots to unpack.

nl1 day ago
> Try to get the AI to draw the pelican form a very odd angle - like underneath, to the right, one wing extended, one wing not ... 0% chance.

Proof by existence?

https://gist.github.com/nlothian/50241d34a654fcf0caa280d4475...

Looks pretty good to me. ChatGPT in "Thinking" model.

Edit: I've added the Opus version on the same link.

koonsolo1 day ago
> That might be the case, but Simon's case "Generate an SVG of a pelican riding a bicycle" is very different.

When it was new, sure. Right now, models can be trained on that because everybody uses it as a benchmark.

bluegatty1 day ago
The model is trained on lines, and the word 'pelican' and not much more.

The model does not 'understand' comprehensively the relationship between anatomy, dimensionality, etc..

iammjm1 day ago
you can replace the pelican and the bicycle with your preferred animal and a means of locomotion. I bet you can come up with a pair that definitely wasnt in the training data
viking1231 day ago
Wow! Actually a sensible comment under all the astroturfing that even this place is so full of now.
qiine1 day ago
>pelicans can’t ride bicycles... and there’s zero chance any AI lab would train a model for such a ridiculous task.

humm

bunzee1 day ago
Spot on. Building our tool, we found AI is magic at scraping competitor data, but terrible at market validation. The 'why' is strictly human.
2001zhaozhao1 day ago
So, the best way to use LLMs is to wait for your competitors to do market validation and then scrape their data.

Hmmm......

tardedmeme1 day ago
It's always been much easier to copy an existing product than to make a new one nobody's thought of before.
kkarpkkarp1 day ago
sorry but how this comment refers to the commented post?
ex-aws-dude1 day ago
Is the RLVR the key breakthrough for the uplift or is there more to it?

Does that suggest the uplift was only for things that are easily verifiable like code?

vanuatu1 day ago
Yes, with good RLVR at scale you can greatly improve performance especially on benchmarks

The hope was that good RLVR on relatively contrived datasets (like benchmarks) would be generalized to good software taste, which has somewhat succeeded but also the models fail in horrible ways still

And the hope beyond that is that good skills in fundamental problem solving tasks (coding, math) would generalize to tasks beyond math and code, which did happen but less so

rdedev1 day ago
I would say that most improvements are in easily verifiable things like code or math. Atleast that's where all the amazing results seem to be coming from.

Other domains I am not sure but I've heard from people like Cal Newport that the rate of increase outside of code and math are not as equally impressive

4b11b41 day ago
RL we're gonna find out will get abandoned cuz we don't even know what is getting "aligned", just my naive gut feeling don't take it seriously
DeathArrow1 day ago
Apart from GLM 5.1 and Qwen 3.6, there are other Chinese models that are noteworthy: Kimi K2.6, Xiaomi MiMo V2.5 Pro, Deepseek v4 and MiniMax M2.7.
simonw1 day ago
100% true - I only had five minutes so I had to edit it down to just a couple, but all of those models are excellent and keep leap-frogging each other.
rahimnathwani1 day ago
Looking forward to next time, hoping you mention speculative decoding and MTP :)

It would support your point about the performance of 20GB local models.

subarctic1 day ago
Is there a video of this talk?
MagicMoonlight1 day ago
They’re definitely RL training the models on the pelican test. They patch any kind of test that shows them performing poorly by hardcoding some answers into the model.
inglor_cz1 day ago
"there’s zero chance any AI lab would train a model for such a ridiculous task"

Hmm, given how small the nerd community is and how often I met that task either on Hacker News, or on various Substacks, I am not so sure that the AI labs would ignore it completely.

Advertisement
bob10291 day ago
It definitely seems like the point of no return has been passed.

The size of the codebase doesn't matter anymore. In fact, I am finding that the larger the codebase the better the performance. Starting from scratch with vague ambition is not the same as solving a specific stack trace over a mountain of decade-old code. The later performs better and is also more exciting for the business. It would seem more callers = more constraints to verify against.

For the last 3 months I've felt like I've been dropping gps guided bombs from orbit. No one can tell the difference between AI authored and my hand written code, other than via the implication of the radically increased daily work volume. There's definitely AI in there, but it's like a homogeneous cybernetic blend of my work and the computer's. I own all of it, can explain all of it immediately, but I only wrote maybe 10% of it by hand.

The development team should be mostly "solved" by now with regard to the AI transformation. If you are still at Home Depot picking out your proverbial hammer, it's time to start heading for the self checkout. The rest of the business is where the real money and headlines will be made at this point. AI writing code is ancient news now. Custom harnesses that business people can use to automate workflows will print a lot more money. Bringing some bacon to the rest of the business may also help to preserve your career path in these uncertain times.

Remember what Jobs said about the customer. A lot of times, people don’t know what they want until you show it to them. Most people wouldn't have believed the iPhone was even remotely possible until the moment it was publicly revealed and made available for purchase. I am finding the same effect in the business with AI. What it can actually do when well engineered and applied to the domain will usually outperform the expectations of its users by a wide margin. All these fears about alignment, hallucinations, cost, ethics, the environment, my ego/career, etc., seem to melt away like some kind of luxurious chocolate once the performance becomes clear to the executive staff. I was able to convince the board with an unsolicited, 5 minute demo I didn't even personally deliver. I've never seen these people sign contracts so quickly.

3l3ktr41 day ago
I'm always surprised to see HN people saying models aren't good. What are these guys building? The best engineers I know, from startup to big tech admit these models are incredible. Including people I don't know personally, foundational engineers from every area. The average HN person though, is doing some quantum-alien computation that not even the best developers in the world can grasp.
bradley131 day ago
As someone who uses AI daily (not in agent mode, just user-interactive), I have definitely noticed major quality improvements over the past few months. And that's surprising, because when you use something daily, you tend to overlook the big jumps.

I haven't looked into any sort of "agent" mode, just because I don't yet quite trust the AI not to do something dumb. Also, I don't use M365, where Copilot is integrated, so I suppose I would have to set it up myself.

zkmon1 day ago
What real world problem is closely linked to the skill of drawing a pelican riding a bicycle?
gib4441 day ago
Starting from zero today, how would someone quickly get upto speed with the latest and greatest AI tooling on an extremely limited budget?

Is the only choice to pay for the "max" plans?

Or just read so much about it that you bs your way through an interview and then use the company's resources?

Simon, I'm curious too how much you invest each month researching all the latest and great AI tech?

x86cherry1 day ago
Opencode has free access to Qwen 3.6 and Deepseek v4 Flash right now.

They're on par with Claude and Codex imo - when you still design architecture and know what the output should be. Claude and GPT 5.5 need less guidance with vibe coding, but we're not yet at a point where that's sustainable anyway even with those models.

NortySpock1 day ago
I made an account on OpenRouter.ai , created an API key, plugged the API key into the Zed editor, and started asking free models questions about my codebase.

Once I felt I had some confidence on what the spend rate would be, I bought $20 USD worth of credits and would occasionally point my editor at a cheap paid model for some real-time questions.

I've still only spent less than $2 in credits so far, as often a free model can answer my question fast enough.

I have not yet tried agentic coding, but at least with OpenRouter API keys it's trivial to cost-cap keys so you can pay for lower latency and still cap your spending.

RobinL1 day ago
$20 chatgpt pro plan gives pretty generous usage both of codex, general chat
gib4441 day ago
Ah I'd read so much about the downgrading of that plan I didn't think that was still true?
azuanrb1 day ago
It depends on what you’re comparing it against. For $20, OpenAI is still probably the best value for SOTA models. In terms of limits, you can use GPT-5.4 instead of 5.5. The intelligence feels similar, but it’s cheaper. You can also experiment with other harnesses like pi. It’s lightweight but capable enough, and its token usage is definitely much more efficient.
myaccountonhnabout 23 hours ago
Opencode go + pi.dev is 10$ a month.
DeathArrow1 day ago
>Starting from zero today, how would someone quickly get upto speed with the latest and greatest AI tooling on an extremely limited budget?

Z.AI, Moonshot.AI, Xiaomi, Minimax, Alibaba all have coding plans that allows a massive usage of GLM 5.1, Kimi k2.6, Minimax M2.7, Qwen 3.6 Plus, Xiaomi MiMo v2.5 Pro for cheap.

Pair those coding plans with the harness of choice including Claude Code and you are good to go.

Also, Nvidia is offering free access to top models for free through NIM - but you have 40 RPM limits. https://blog.kilo.ai/p/nvidia-nim-kilo-code-free-kimi-k25

aizk1 day ago
I'm so glad Simon is documenting this. The field is evolving so fast, so rapidly, so hungry for data and money, that few are willing to zoom out and document everything big picture so we can see the changes over time. I mean do you guys remember "Do anything now"? Just a distant memory, a funny party trick.
hansmayer1 day ago
TL;DR:

"Coding agents got really good - here, a bunch of non-releavant slop-pictures of pelicans riding bikes as a key benchmark AND a couple of hardly relevant edge-case demo-projects of mine to prove it right! "

Come on man, where is the AI writing all the code in 6 months? We're close to June and Amodei's latest statement from January does not look like going into fulfilling over the next weeks, does it now?

tayo421 day ago
The claw thing really came and went fast lol
yieldcrv1 day ago
I just started a new job and the person I report to was just excited to tell me about it, here in Mid May

"and then you have to get a mac mini, and then, and then"

smile and nod, it pays weekly

viking1231 day ago
I mean yeah? It was marketing campaign to boost the model providers and give Steinberger a cozy job at OpenAI. Hook, line and sinker.

Wake me up when we have an agent with constant learning and changing weights that I can have personally, not some LLM that can always fall prone to jailbreak and context injection attacks.

You think most of this stuff here is organic? Oh boy..

DeathArrow1 day ago
I think that there's a lot to be improved in harnesses and the way the models are interacting with harnesses. For example, the harness should be able to steer the model when thinking.
_josh_meyer_1 day ago
a 5-minute video version (with local TTS model) https://tldr-api.manatee.work/v/dmYg0U
Advertisement
Razengan1 day ago
AI is like Sauron's Ring: it only amplifies the user's innate abilities.

It can either help you conquer the world if you were already doing that anyway or it can make you spend your life in a cave before throwing you into a fucking volcano.

wewewedxfgdf1 day ago
Does this guy have a "publish to front page of HN" button on his blog editor?
xnorswap1 day ago
HN has a mechanism that causes popular blogs to stay popular.

It's a winner-takes-all karma prize for being first to post the article.

This causes a rush of people to post.

HN has a mechanism by which duplicate submissions count as upvotes toward the first submission.

This is a positive feedback for the desire to be first, which increases duplicate submissions and in turn the karma reward.

This effect means that good blogs stay well upvoted. This isn't altogether a bad thing, but it does mean some blogs require a string of poorly received posts before that effect wears off and people no longer rush to be first.

One way to fix this would be to attribute all karma to user simonw himself ( and do similar where attribution to an HN user is known. )

nickvec1 day ago
He’s pretty well known in the HN community. https://en.wikipedia.org/wiki/Simon_Willison
koolala1 day ago
thats a cool wiki picture
simonw1 day ago
I didn't even submit this one. I didn't actually think this was a good fit for hacker news, the pelican bicycle thing is pretty much played out here already!
schnitzelstoat1 day ago
I liked the article, so if he has such a button I hope he keeps clicking it.
specproc1 day ago
He's one of the main developers behind Django.
dcminter1 day ago
Years ago I used to read his blog on Django and found it quite interesting despite being neither a Django nor even a python user - this must have been at least 10 years ago and perhaps more.

When he resurfaced in my feeds as an AI commentator it took me quite a long while to join the dots that he was the same person!

koolala1 day ago
its better than ex-google CEO spam i see astroturfed everywhere else
victorbjorklund1 day ago
he usually have good posts so people usually upvote
troupo1 day ago
He has the most measured (and often quite detailed) posts on LLM and LLM progress, and is the opposite of hype.
nothinkjustai1 day ago
[flagged]
tomhow1 day ago
We've banned this account.

We detached this comment from https://news.ycombinator.com/item?id=48189072 and marked it off topic.

bb881 day ago
I met Simon for the first time this year at pycon. Wow, what a great guy.
iekekke1 day ago
It’s good to see dates being hard coded re. Improvements in the models that should deliver material gains.

As time progresses one now has a yard stick to measure against progress. No more excuses - show me the money baby.

jrowen1 day ago
There's something fitting about the mystical nature of LLMs and scrolling through a bunch of goofy pelicans on bicycles representing report cards for the bleeding edge of technology.

How are these even graded? Qwen3.6-35B-A3B gets high marks for a pelican with a gaping hole in its bill?

edit: Just noticed its feet are disconnected from its legs as well (but right on the pedals!). Pardon my French but that's Chinese af.