When I reject AI code even if it works

vvnbrs about 3 hours ago 42 commentsRead Article on vinibrasil.com

⚡ Community Insights

Discussion Sentiment

41% Positive

Analyzed from 2943 words in the discussion.

Discussion (42 Comments)Read Original on HackerNews

Aurornis•about 2 hours ago

Even using Fable (while it was briefly available), having it refine a plan, and directing it to make only small incremental changes, I still found reasons to reject its first pass at a lot of work. There was a lot of “You’re right to push back” responses. A lot of incidents where it would creat some giant complex set of abstractions to accomplish something that I could find ways to do much more elegantly and in a more maintainable manner.

It’s really eye opening to work with these tools on a codebase you know deeply because these problems are everywhere.

However if I opened an unfamiliar project in another language and I wanted to add a little feature with no intention of maintaining it, I’d happily accept the changes and loop until it worked well enough for my temporary needs.

The scary middle is when you’re dealing with coworkers who don’t care about anything other than closing tickets and collecting credit. With enough of a token budget you can now wrap loops around an LLM and have it try things until the program appears to work. Ask it to do a code review and then submit the PR without having understood what it was doing. There are a lot of workplaces where there isn’t a good mechanism to push back on this and the tech debt just keeps growing.

abhgh•about 1 hour ago

These "You're right to push back" scenarios are scary for me. I mostly code ML implementations, and some of the errors Claude Code (CC - have only used Opus 4.7) makes are very sneaky, and if you don't have sufficient experience in the area (I see this with people entering ML and writing their implementations with CC), you wouldn't know when to question CC and will let errors or future pitfalls silently slip into your code. A recent example was when there was data leakage in a model calibration step, which it refused to see as an error, till I wrote a detailed reason, and then it agreed that there was a "subtle leakage".

resonious•about 1 hour ago

All Claude models are huge suck ups. The "you're absolutely right" meme is real even if that exact phrase doesn't show up as much anymore.

I don't want to start a fight or anything but IME Codex has a bit more of a spine. If you point out something weird, it sometimes gives a good reason for it. Whereas Claude will always say "whoopsie you're right as always sir" even when it's me who missed something.

herdymerzbow•40 minutes ago

I only use free AI chats to help me with my learning, but often I direct its responses neutral and to refrain from providing any encouraging language, or value judgements. It tends to get rid of these 'you're absolutely right' comments when I point out a mistake.

But your comment just made me think whether this tendency for LLMs to resort to flattery when found out is a built in strategy to distract the user from the error prone fragility of much of the output? It's perhaps a stretch to think these canned responses were put in strategically, but the result is that the user's attention may be deflected to contemplating their own superior knowledge and insight, and bask in the glory of all that, but then forgot to appreciate that 'Hey, chatLLM is just making all this stuff up/doesn't know which way is up/or down!'

teaearlgraycold•about 1 hour ago

Right now the thing I get from Opus 4.8 is a ton of “That’s a good instinct”. Also >50% of its closing statements begin with “Clean.”

darkerside•about 1 hour ago

In fairness, you could throw the most senior engineer into a brand new codebase, and they would probably make a dozen mistakes if you immediately had them pick up invasive and risky work.

embedding-shape•about 2 hours ago

> There are a lot of workplaces where there isn’t a good mechanism to push back on this and the tech debt just keeps growing.

If the "big ball of spaghetti" theory holds, where software companies who can't manage the debt stumble over themselves as they continue to add to the big ball of spaghetti code, I guess we'll see a row of companies declaring "software bankruptcy" or something in some/many months, depending on how well these workspaces learn to care slightly more and get better at pushing back against slop.

codemog•about 1 hour ago

Coding agents have been better than the average "enterprise" programmer for a while now and nobody wants to admit it or talk about it. I have never seen an agent output an implementation called FooImpl that's tens of thousands of LOC in a single file, but I have seen plenty of human code like this.

People call coding agents bad because they don't know the asinine meaningless conventions at their particular company while they themselves write awful abstractions and brittle tightly coupled systems, but hey, at least they know how to write a for loop how their particular company likes.

fzeroracer•about 1 hour ago

> I have never seen an agent output an implementation called FooImpl that's tens of thousands of LOC in a single file, but I have seen plenty of human code like this.

And how long does it take a coding agent to output a thousand lines of code versus a human? The worst human at any company was rate limited by themselves. Those 'average enterprise' programmers aren't going away, they're the ones now spending tens of thousands on coding agents and filling your codebase with even more garbage without bothering to review an iota of it.

busterarm•about 2 hours ago

> With enough of a token budget you can now wrap loops around an LLM and have it try things until the program appears to work. Ask it to do a code review and then submit the PR without having understood what it was doing. There are a lot of workplaces where there isn’t a good mechanism to push back on this and the tech debt just keeps growing.

I'm not making an argument in favor of people using LLMs for this, but people were doing this before we had LLMs it was just usually a bit slower. I can't even say it usually doesn't work out long term because I worked with a lot of guys who did this and took a ton of Adderall while working practically around the clock. Every incentive structure in the organizations rewarded it along with social credibility from more junior engineers. (The last cowboy I worked with who pulled this shit ended up becoming the most senior engineer in the company, a multi-millionaire and worshipped like a god by 90% of the mostly fresh grads we were hiring).

The problem is when invariably these people burn out eventually and leave, they leave a massive vacuum in their stead. Not from load they were carrying but creating.

I think the larger the organization I've been at, the more they reward the people making huge commits on nights and weekends. Worse, they could get away with TBRing their shit and merging it without review.

LLMs are often all of the bad habits and organizational problems that we already carryied just being speedrun. There are some places doing it right, but they already were.

timacles•about 1 hour ago

> There are some places doing it right, but they already were.

Could you be more specific what "right" is?

> I can't even say it usually doesn't work out long term because I worked with a lot of guys who did this and took a ton of Adderall while working practically around the clock. Every incentive structure in the organizations rewarded it along with social credibility from more junior engineers. (The last cowboy I worked with who pulled this shit ended up becoming the most senior engineer in the company, a multi-millionaire and worshipped like a god by 90% of the mostly fresh grads we were hiring).

I'm having a tough time believing this, it sounds like you're trying to backwards rationalize more productive engineers were "on drugs" and they delivered but "did it wrong"

jdw64•10 minutes ago

Coding with AI eventually comes down to two paths, I've realized. One is using AI exclusively for everything. The other is not using it at all. There is almost no middle ground. The reason is that as the complexity and depth of the problem increase, the code AI generates increasingly follows enterprise level patterns. The deeper the meaning of what I input, the more AI tends to produce code that goes beyond my own area of expertise. For example, a human expert's code is very powerful and deep within their own domain, but when you look at the entire codebase, it's often shallow and uneven outside that domain. But the moment you write code with AI, once you go deep in one part, AI tries to standardize the rest accordingly. This means the entire codebase converges toward enterprise level standard code, which essentially reflects the average patterns of senior programmers who built large scale systems.

The problem is this. Human cognitive resources are finite, so we inevitably become shallow outside our own expertise. There is no programmer who can do everything well. And as systems grow in scale, they become more modularized and fragmented, making it impossible to understand the whole system. So what should we do about this? That's always the question.

In the end, do I choose not to use AI, finish the project with uneven code outside my domain, and deliver it? Or do I use AI and deliver a program that is uniform and consistent, but not in my own style? I still don't know. I haven't found the answer yet.

ecshafer•about 2 hours ago

If we rephrased this to "When I reject my coworkers code even if it works" and give the same reasons there would be zero dissent. There is this weird idea that seems to come up with AI that any solution must be good and adequate. Software Engineering is all about rejecting code that works for the right code that works.

api•about 1 hour ago

Which means it doesn’t matter if the code is from AI or not.

If it’s not good it’s not good.

wwind123•28 minutes ago

I use 3 AI's (Claude, GPT and Gemini) to review each other's design plans and implementation on the same code base. Each often catches problems the others miss.

I try to make sure the architecture docs of the code base are refreshed regularly based on recent changes, so it's easier for humans and AI agents to make sense of the code.

I also regularly stop all other developments and just focus on auditing the code base with these AI's to make sure they are secure, robust, clean, and well structured and well tested -- some refactoring would be needed most of the time, and it's well worth it.

With this approach, nowadays I often merge code from AI without completely understanding what it's doing, but seems the code has been working so far. :)

BobbyTables2•5 minutes ago

You’ve transitioned from “individual contributor” to “manager”! (;->

cadamsdotcom•32 minutes ago

If you reject AI code that works then your mindset is still too hands on. Put another way - you still have some loops to work on taking yourself out of. The agent should’ve delivered code that was acceptable as a first pass.

Agents respond really well to feedback! They have no ego and they’ll happily improve code if told where and how. But you need to provide the tools that provide that feedback without your involvement - otherwise you can’t scale.

All the linting and autoformatting you can put in, is a good start. Next, create custom scripts that check for every single dumb AI-ism you can think of, tell the agent about them, tell it to use them to check its work, and put them in hooks so the harness refuses to let the agent stop until all your linters show no errors.

Then, keep iterating basically forever. Any dumb AI-ism you see, make a linter for it, give it to the agent, and enforce it using the harness.

I’ve spent months doing this. When I review a PR - which was built by the agent with TDD so it definitely works - I’m no longer asking if it did dumb stuff or confirming it conformed to the architecture or duplicated code or missed opportunities for reuse. That’s all linted for. I don’t worry about duplication or outdated docstrings/comments because the self review caught all that. I mostly read it to look for opportunities to make the feature even better & more useful.

If this makes no sense or you disagree it’s possible, my contact details are on my profile and I’ll be happy to give a demo.

equinumerous•28 minutes ago

I am very curious what some of your lint rules look like in practice. In my mind a lot of the AI-isms in my code that I hate are stylistic or a matter of taste, not necessarily something I could write a deterministic rule to check. But I want to hear more. Like, what kind of linters did you create and which were highest impact?

unknownfuture•19 minutes ago

Frankly, if that's truly your flow, then you cannot possibly know if the code really does what you expect it to do.

"TDD" isn't some magic trick. The tests codify the expected behavior. But if you don't review them for correctness, if you let the LLM build them blindly, then you have no idea what those tests assert and can make no claims about whether the code then does what you expect.

That's fine. That's your choice.

But you have to acknowledge you've chosen to accept that you personally cannot vouch for the quality or correctness of that code.

I fully expect this to be the direction the industry goes, where increasingly complex systems exist that no human actually understands or can reason about.

I think it's bad for the industry. Very bad.

But I'm not making those decisions, so... it is what it is, I guess.

julianlam•22 minutes ago

I think a particular failing with developers embracing AI is fighting the sunk cost fallacy. While you might not have spent as much time putting together a non-working solution, you still did spend time working with the agent to slap together a non-working solution.

Being able to step back and say "this was a failure and we need to discard the day's work and start over" is still hard with LLMs.

krupan•about 1 hour ago

And again this makes me wonder, is AI really helping if this much review and rework is needed for all the code it writes?

unknownfuture•9 minutes ago

I mean, the reality is a ton of folks in the industry, myself included, are writing glorified CRUD apps in their day jobs. We're building into existing codebase with established infrastructure and ways of working. What we're building isn't inherently complex or very interesting.

Meanwhile, the those codebase often require a ton of boilerplate and drudgery

In these spaces it's very easy to read and comprehend AI generated output and review it fairly quickly. So the time savings from dealing with all that boilerplate and conforming with all that existing infrastructure are potentially substantial.

teaearlgraycold•about 1 hour ago

Depends on what it’s writing. There are times an LLM saves me a lot of time researching library functionality. Especially with testing frameworks. So many strange and arcane features out there beyond the basics, but not hard to understand what they do once you see the code. On that topic I should say I am careful when reviewing the actual test cases.

However if you’re highly familiar with a domain then LLMs are much less useful.

summerlight•about 2 hours ago

My personal rule of thumb: I am usually okay with agents driving e2e implementations if this won't make life noticeably worse when it does not work. Some analytical code? Perfectly fine. Hobby projects? Fine, though I prefer doing a fun part myself. Refactoring production code generating 10x more revenue than my salary? You'd better be at least understanding what it does.

resonious•about 1 hour ago

Yes this is the thing with these new tools. You have to know when to use them and when not to.

Good ol' software architecture tricks can also help you slot "vibe coded" components into a larger system safely.

datadrivenangel•about 2 hours ago

"The reality is that code that runs and makes the CI green can still be a bad solution, and engineering has always been about implementing adequate, scalable, and extensible solutions."

Adequate often means done and cheap

josephg•about 2 hours ago

> Adequate often means done and cheap

It really, REALLY depends what you're working on. If you're throwing together an internal tool or simple dashboard, it doesn't really matter what the code looks like. But if you're writing software that other programs will depend on, bad design choices ripple out and affect another generation of software. Imagine slop in the linux kernel, in google chrome, or in your compiler or runtime. Its not acceptable.

I know a lot of people spend their careers writing end user software and web UIs. AI is increasingly a good choice for this sort of code. But that's not all of us. And its not all of the software being written.

DrewADesign•about 2 hours ago

As long as safe and stable are assumed to be base-level requirements… maybe?

skydhash•about 1 hour ago

I was just watching a video about system engineering and the following stucks:

Stakeholder needs: What people wants to get done with the product

Management needs: How to manage the spending of resources (time, money,…) to create the product

Engineering needs: What is the product

You have to balance the three. Sometimes it’s simple and easy to get right. Sometimes it’s complex enough, you’re never truly sure until the product is out in the wild.

Software is malleable and we can do easily do iterations which is not possible with hardware. But today, we have a skew towards engineering, where the whole focus is to create a solution, whatever that is. No understanding of the problem, no proper allocation of resources, just do something. Even if it is plastering over the crack for the eleventh time.

solid_fuel•about 2 hours ago

Disagree, adequate means adequate. Done and cheap is what you call it when a solution is adequate. If the solution isn't adequate, it doesn't matter if it's cheap, because it isn't done.

AmareshHebbar•about 1 hour ago

If I can't explain the code without rereading the diff, I probably shouldn't merge it.

rvz•about 1 hour ago

> Before coding agents, when given a task, I would explore the codebase, think of different solutions, experiment, and only then implement. That could take days of consolidating all that context. When I finally submitted that PR, confidence was higher, and explaining each of my changes to my coworkers was easier.

Now we are getting to the point where we are speed-running the deskilling of engineers into comprehension debt and they themselves rapidly losing confidence in reviewing code they did not write.

I think this blog post [0] is the best example of what could go entirely wrong and even worse when you do not know the technology.

If you cannot explain a change even when "the CI is green" or "all tests passing", I will immediately reject it.

Maybe great for vibe coding prototypes, but it all changes when that code is deployed onto mission critical systems. Just ask Amazon with Kiro. [1]

[0] https://sketch.dev/blog/our-first-outage-from-llm-written-co...

[1] https://www.reuters.com/business/retail-consumer/amazons-clo...

eranation•about 1 hour ago

LLMs diverge, not converge. They slightly increase entropy if not controlled. While you can have DRY skills and use AI to organize AI (in loops(tm) like Boris does) but eventually if you don’t understand the code, you are taking yourself out of the loop. And not just the job security that’s on the line, it’s the increasing cost for AI to babysit AI. If you or your “loops” (or paperclip, Hermes, gastown, or next in class agents of agents that runs your entire company) let it gradually sneak in slop-debt, the cost to fix it later will become prohibitive. (You can always just rewrite it, but as the race for “feature complete” and “zero backlog” continues, rewriting an ever growing set of new daily table stakes will become an economical moat)

TLDR: Keeping your codebase human readable and reason-about-able is not just helping humans to stay relevant. It will save costs for LLMs to maintain it.

_wire_•about 3 hours ago

"Even if it works?"

How do you verify that it works?

serious_angel•about 2 hours ago

For example, the following "works":

    json='{ "left":2, "right":2 }';    
    result="$(
        perl -e '($_)=<>; / "left":(\d+), "right":(\d+)/; print $1 + $2, "\n";' <<< "$json";
    )";
    printf '%s\n' "$result";

Yet, it is literally the same as:

    printf '%s\n' "$(( 2 + 2 ))";

p1024k•about 2 hours ago

According to the author's intention, it is the code that he cannot understand or control. Even if the solution provided by the AI works, he will not adopt it. This is unless he can understand or control it. This should be an assumption.

However, if AI provides a solution, as the person using AI, one should conduct research before making a decision. This is not in conflict with or hindered by the use of the ideas provided by AI.

andyfilms1•about 2 hours ago

I will say--as someone who has fielded late night troubleshooting calls--I totally understand OP's point of view. It's reasonable to expect that you will be able to answer questions about something that you ship, or brainstorm ways to solve a problem a customer is encountering while using something you provided them.

The obvious counterargument is "well, just ask the AI for those answers," but the AI lacks the context and experience that you have. Sometimes, genuinely, the user really is just "holding it wrong," but none of the current AI models would ever admit that, and you'd spend hours trying to solve an unsolvable problem.

Grombobulous•about 2 hours ago

I think this policy is probably more prescriptive than I would go with myself. I like to think of my risk tolerance first to help make that determination.

For example, I use a vibecoded internal tool written in Go. I don’t even know how to write Go. Haven’t read a single line of the code. I just wanted to move from bash scripts to using cloud SDKs for performance reasons.

But the internal tool is a convenience tool, and you can do everything it does using alternative methods. So if it break, there is no real negative impact besides personal convenience of anyone using it. There’s some documentation on how to do everything manually if needed.

Here’s another example: you’re making a static website. No JavaScript, no interactivity. Truly, what could go wrong? And while I do understand HTML a lot better than Go, it wouldn’t really matter if I didn’t.

skydhash•about 1 hour ago

> Here’s another example: you’re making a static website. No JavaScript, no interactivity.

Linking a huge file consuming clients’s bandwith for no reason. Embedding PII in the html source? And if setting up your own server, misconfiguring it?…

fzeroracer•about 1 hour ago

If I'm on call solving a problem another engineer caused and I reach out to them for clarification and they say 'I don't know, the AI wrote it' I am going to advocate for them being fired tomorrow.