ZH version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
70% Positive
Analyzed from 5310 words in the discussion.
Trending Topics
#rust#code#tests#language#claude#llms#more#don#llm#test

Discussion (185 Comments)Read Original on HackerNews
This back and forth will take quite a while, but the resulting implementation plan will be 10x better than the original.
You can automate this by giving Codex a goal, and a skill to call Claude to review the implementation spec until they both agree it's done.
Then, for critical code, have them both implement the spec in a worktree, then BOTH critique each other's implementation.
More often than not, Claude will say to take 2 or 3 pieces from it's design over to Codex, but ship the Codex implementation.
Jokes aside, I agree about having LLMs iterate. Bouncing between GPT and Opus is good in my experience, but even having the same LLM review its own output in a new session started fresh without context will surface a lot of problems.
This process takes a lot of tokens and a lot of time, which is find because I’m reviewing and editing everything myself during that time.
and in both cases i both “know” that i can tell the difference and “know that i cannot tell the difference”. what anyone takes from that in terms of what it says about me, personally, is a bit of a Rorschack test, but Astrology is about as apt a description as there is… xD
But it's arguably less accurate to the original recording.
(I don't think that's the full picture but, there's definitely something fishy going on there.)
Say what you will with proper reasoning or arguments if you feel compelled, tired reddit-commentary like that helps no one.
We're year 4 into this discussion and camps have only gotten more bifrucated. There's no 1-1 discussion to have about this as of now, at least not before the crash.
Your only hope in such discourse is not trying to convince the other party how wrong they are, but appealing to an as of yet undecided party. Be it with reason, or simply pointing out how absurd some comments sound to the average person.
By the end you have piecemeal "tickets" for your coding agent, if you have multiple developers you can sync them all up into github, and someone could take some locally, or you can just have Claude work on all of them with subagents. The key feature there is because its all piecemeal the context stays per task.
Then I run a /loop 15m If you're currently working ignore this. Start on the next task in gur if you have not. If you finished all work and cannot pass one gate, work on the next available task.
(Note: gur is my shorthand for GuardRails)
I also added a concept called "gates" so a task cannot complete without an attached gate, gates are arbitrary, they can be reused but when assigned to a task those specific assignments are unique per task. A task is basically anything you want it to be: unit test, try building it, or even seek human confirmation. At least when I was using Beads it did not have "gates" but I'm not sure if it has added anything like it since I stopped using Beads.
Claude will ignore the loop if it's currently working, and when its "out of work" it will review all available tasks.
If anyone's curious its MIT Licensed and on GitHub:
https://github.com/Giancarlos/guardrails
I mean that if you ask codex on gpt 5.5 to submit to a plan reviewer subagent that uses gpt5.5, this is enough to have a very good reviewing and reassessment of the plan.
My hypothesis is that it’s even better than opus.
The reason why submitting the product of one LLM to another to review is that you need a fresh trajectory. The previous context might have “guided” the planer into some bias. Removing the context is enough to break free from that trajectory and start fresh.
Have Claude produce that spec 10 times, use the same prompt and same context. Identical requests, but you'll get 10 unique answers that wil contradict each other with each response seeming extermely confident.
Its scary how confident you people are in these outputs.
There are real decisions to be made when going from a vague prompt to a spec. It's not surprising that an LLM would produce different specs for the same work on different runs. If the prompt already contained answers to all the decision points that come up when writing the spec then the prompt would already be the spec itself.
An LLM is isn't deterministic but also isn't iterative without an existing human. You give it the same spec 10 times and it produces 10 results that aren't far off itself but vastly different when you go into the weeds. And not different in a way of improvement. |
What has always mattered is how you decide the specs, not the specs in themselves.
But they didn't ask humans, they asked a machine. We expect our machines to behave in predictable ways.
> If the prompt already contained answers to all the decision points that come up when writing the spec then the prompt would already be the spec itself.
This is one of the best arguments against using LLMs I've seen.
It reduces to the classic argument- at the point where you've described a problem and solution in sufficient detail to be confident in the results, you've invented a programming language.
That being said I agree people trust AI too much. Especially people with less experience. It’s easy to forget the models are mirrors of we are as the drivers of the input context not mentors that will guide us to best practices reliably.
I do this with other languages, too, not just Rust. Thing is, you have to put a hard stop at some point because the models will always find something to nitpick.
I know LOC is a silly metric, but ~1300 tests for 130k lines averages out to a test per 100 lines - isn't this awfully low for a highly complex piece of code, even discounting the fact that it's vibecoded? 100 LOC can carry a lot of logic for a single test, even for just happy paths.
If you're building a distributed system and you don't have more tests and testing code than actual code, by an order of magnitude most likely, then you're missing test coverage.
No joke: it works for me. I have a 45kLOC prod code (just code, no comments, no blanks), tested by a 30kLOC test code containing 1600 tests (that run in 30secs).
I helped with the test infrastructure/architecture. Sometimes I had to write the first few tests of a particular kind, but now Claude TDDs for me.
A fair share of my CLAUDE.md instructs in how I like my tests, when to write them (first), different types of tests (unit, faked-services, db, e2e, etc.)
Asking Claude to find weak tests has helped a lot in getting here. I also do review AI-gen'd code, pretty much line-by-line, before accepting it.
Honestly, despite all the hype around Rust in the community, the fact that AI can't handle lifetimes reliably makes me reluctant to use it. The AI constantly defaults to spamming .clone() or wrapping things in Rc, completely butchering idiomatic Rust and making the output a pain to work with.
On the other hand, it writes higher-level languages better than I do. For those succeeding with it, how exactly are you configuring or prompting the AI to actually write good, idiomatic Rust
What harness and model you've been using? For the last few months, essentially since I did the whole "One Human + One Agent = One Browser From Scratch" experiment, I've almost exclusively been doing cross-platform native desktop development with Rust, currently with my own homegrown toolkit basically written from scratch, all with LLMs, mostly with codex.
But I can't remember a single time the agent got stuck on lifetime errors, that's probably the least common issue in regards with agents + Rust I come across. Much bigger issue is the ever-expanding design and LLMs being unable to build proper abstractions that are actually used practically and reduces the amount of code instead of just adding to the hairball.
The issue I'm trying to overcome now is that each change takes longer and longer to make, unless you're really hardcore about pulling back the design/architecture when the LLM goes overboard. I've only succeeded in having ~10 minute edits in +100K LOC codebases in two of the projects I've done so far, probably because I spent most of the time actually defining and thinking of the design myself instead of outsourcing it to the LLM. But this is the biggest issue I'm hitting over and over with agents right now.
The complexities LLMs end up putting themselves in is more about the bigger architecture/design of the program, rather than concrete lines, where things end up so tangled that every change requires 10s of changes across the repository, you know, typical "avoid the hairball" stuff you come across in larger applications...
format: glob: ".rs" run: cargo fmt -- --check
lint: glob: ".rs" run: cargo clippy -- -D warnings
tests: run: cargo test
audit: run: cargo audit
+ hooks that shove the lefthook automatically in the ai's face
---
rustfmt.toml:
edition = "2021" newline_style = "Unix" use_small_heuristics = "Max" max_width = 100
I've not done any particular/ special prompting.
But python or typescript are full of errors all the time. I rather fallback to perl than python. Perl has been excellent all along.
Yes you need to help it with the types of tests: you still need to know what you want from it. But once you have all types of tests (unit, db, fake-services, e2e, etc.) in place and documented; it can basically write tests until you cov-tool says it's 85%. Then you can ask it to find the weakest tests: you review those and make sure they are not weak, or Claude understands why they are not weak. Then let it find the next batch of weakest tests. Etc.
TDD finally makes sense economically for me on the types of projects I usually work on.
What model are you using, and what frameworks are you using?
This is not a hard problem for LLMs to solve.
Rust is nearly the perfect language for LLMs.
It's exceptionally expressive, and it forbids entirely the most common globally complex bugs that LLMs simply do not (and won't for some time) have the context window size to properly reason about.
Dynamically typed languages are a disaster for LLMs because they allow global complexity WRT to implicit type contracts (that they do not and cannot be relied on to withhold).
If you're going to add types, as someone pointed out earlier, why are you even telling an LLM to write Python anyways?
Rust is barely harder to read than Python with types. It's highly expressive.
You have the `&mut` which seems alien, verbose (safe) concurrency, and lifetimes - which - if you're vibe coding... you don't really need to understand that thoroughly.
You want an LLM to write code in a language where "if it complies, it works" - because... let me tell you, if you vibe code in a language where errors are caught at runtime instead of compile time... It will definitely NOT work.
- Garbage collected so no reasoning tokens or dev cycles are wasted on manual memory management. You say if you're vibe coding you can ignore lifetimes, but in response to a post that says AI can't do a good job and constantly uses escape hatches that lose the benefits of Rust (and can easily make it worse, copying data all over the place is terrible for performance).
- Very fast iteration speed due to JIT, a fast compiler and ability to use precompiled libraries. Rust is slow to compile.
- High level code that reads nearly like English.
- Semantically compatible with Java and Java libs, so lots of code in the training set.
- Unit tests are in separate files from sources. Rust intermixes them, bloating the context window with tests that may not be relevant to the current task.
Sounds like your work doesn’t need Rust and that’s ok.
But don’t generalize.
Sure if you want to vibe code a TODO app where it's literally just copying and pasting one it's already seen 10,000 times before, it can do it in Python.
This hasn't been true since around gpt-4.5 on the OpenAI side of things. The 5.x models have been pretty much solid on Rust for a while now.
It sets up your repo to ensure agents use a workflow which breaks your user requests down into separate beads, works on them serially, runs a judge agent after every bead is complete to apply code quality rules, and also strict static checks of your code. It's really helpful in extracting long, high-quality turns from the agent. It's what we used to build Offload[1].
0: https://github.com/imbue-ai/rust-bucket : A rusty bucket to carry your slop ;)
1: https://github.com/imbue-ai/offload
Fixed.
This is a problem when language designers are mathematicians and don’t understand typographical nuance and visual weights.
The whole "with AI" kind of reduces my hate for Rust though, and increases the appreciation for how strict the language is, especially when the agents themselves does the whole "do change > see error/warning > adjust code > re-check > repeat" loop themselves, which seems to work better the more strict the language is, as far as I can tell.
The "helpful" error messages from Rust can be a bit deceiving though, as the agents first instinct seems to be to always try what the error message recommends, but sometimes the error is just a symptom of a deeper issue, not the actual root issue.
I mean God help us should a crustacean try to understand the merits of my claim.
“Oh he’s saying something negative about rust…” Downvote!
I think with AI the language should still be readable. Humans need to be able to understand what’s going on!
(Yes, I know the 'a lifetimes are a bit weird, and that's not something that exist in typescript, but that's also not something you use everyday in Rust either.
If you want to give it a fair shot, it does take some time to get used to, coming from something like Python or Ruby. I won't deny that. I've found that using LSP-assissted semantic syntax highlighting helps, for me, on the typographic front.
I don't think typographic design is a key consideration in most languages' designs, though, and I don't think it should be. The main thing I look for is consistent, relatively predictable rules around the syntax, as far as that layer of language choice goes.
In tsz I have hard gates that disallow doing work in the wrong crate etc.
https://github.com/mohsen1/tsz
Maybe I'm using agents wrong, but I'm not sure how you'd end up in that situation in the first place? When I start codex, codex literally only has access to the directory I'm launching it, with no way to navigate, read or edit stuff elsewhere on my disk, as it's wrapped in isolation with copied files into it, with no sync between the host.
Hearing that others seemingly let agents have access to their full computer, I feel like I'm vastly out of date about how development happens nowadays, especially when malware and virus lurks around all the package registries.
My issue is specifically with how the AI uses it. In AI code, .clone() is almost always used as a brute-force escape hatch
Maybe it's harder to reason about the lifetime semantics while also writing code, and works better as a second phase (the de-cloning).
This is from 2025 - I would like to see an update now how that system turned out to be after the vibe hype
If it is, and it works well, then to me this is far more meaningful than the fact that AI wrote 130K lines of code.
I also had it implement a wasm geodesic calculator in Rust and it's amazing and in my use case is better than geodesiclib using the same updated algorithm.
I'm a "C-nile" Rust folks love to hate and did my first hacking in C Deep Blue C on Atari 8-bits. But I'm very impressed with these products and with the ability to leverage some features of Rust with them. (e.g. audit every unsafe instance and define its invariants, etc.)
I also agree with the commenter who said these LLMs are today, at the present moment, good at Go. The only language I notice it seems to be really good above and beyond others at is javascript, I assume because there's so much of it.
The interesting thing is that it was manageable solo (in many ways it's _more_ manageable solo+AIs than with coworkers+(their)AIs), and in such a short amount of time.
In the end it is just a lot of unmaintainable code quickly generated by AI.
Rust makes no promise of being terser than C++, and RSL does less than this considering the optimization.
Also it's only 45/50k LOC so not so very from the 36k LOC.
That's great, non-test code is only ~47k lines of code.
Depending on your backend you either ignore them, check them all of the time, some of the time, or have SMT-solvers prove that if you uphold the first one all else must follow.
If you're interested in the last one, have a look at Dafny[0]
[0] https://dafny.org/
Original RSL library has 36 KLoC across C++ source and headers files. Rust supposed to be more expressive and concise. Yet, AI generated 130k LoCs. I guess nobody understands how this code works and nobody can tell if it actually works.
Change the skills, ask the agent to do exactly the same in something else.
I am slowly focusing on agent orchestration tools, which make the actual programming language as relevant as doing SOA with BPEL.
Also it is kind of interesting that there is so much enthusiasm to use Claude and Claw all over the place, yet lack of vision on how much the whole infrastructure will improve.
Even when it finally bursts and we get into another AI Winter, what was already achieved isn't going away.
It works for humans because when we get a borrow-check failure, we take a step back and think about the global shape of our code and ownership. LLMs path straight to the goal. Problem: code doesn't compile. Solution: more clone()
https://en.wiktionary.org/wiki/learnings
If you're fine with the generalized form "learned a lesson", then surely "learnings" is fine too. There's no point in trying to police a completely normal and sensible use of language.
Anyway, I accept this usage of the word "lesson", so I also accept "learnings". My point was one of hypocrisy, not policing people in how they can use the word "lesson".
Go is much better target, i've observed rails/ruby code is also much easier for AI to spit out.
And Haskell flies with AI
Rust doesn't add anything over Go for LLM coding.
You can’t have contracts defined in comments in code because there’s no guarantee they won’t be deleted or changed.
Even better, we need the ability to embed directives to LLMs which are NOT comments, but a type of programming construct specifically for this purpose.