RU version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
66% Positive
Analyzed from 3598 words in the discussion.
Trending Topics
#code#deterministic#agents#llms#llm#agent#should#more#software#system

Discussion (68 Comments)Read Original on HackerNews
- Please consult me when you encounter any ambiguous edge cases
Attaching the AI to production to directly do things with API calls is bad. For me the only use case where the app should do any AI stuff is with reading/categorizing/etc. Basically replacing the "R" in old CRUD apps. If you want to use that same new AI based "R" endpoint to auto fill forms for the "C", "U", and "D" based on a prompt that's cool, but it should never mutate anything for a customer before a human reviews it. Basically CRUD apps are still CRUD apps (and this will always be true), they just have the benefit of having a very intelligent "R" endpoint that can auto complete forms for customers (or your internal tooling/Jenkins pipelines/etc), or suggest (but never invoke) an action.
llm -> prompt -> result
llm -> prompt + prompt encoded as skill -> result
llm -> prompt + deterministic code encoded as skill -> result
I do think prompting to generate code early can shortcut that path to deterministic code, but we're still essentially embedding deterministic code in a non-deterministic wrapper. There is a missing layer of determinism in many cases that actually make long-horizon tasks successful. We need deterministic code outside the non-deterministic boundary via an agentic loop or framework. This puts us in a place where the non-deterministic decision making is sandwiched in between layers of determinism:
deterministic agentic flows -> non-deterministic decision making -> deterministic tools
This has been a very powerful pattern in my experiments and it gets even stronger when the agents are building their own determinism via tools like auto-researcher.
https://lethain.com/agents-as-scaffolding/
Your comment EXACTLY mirrors my experience. Week 1 was ever expanding prompts, and degrading performance. Week 2 has been all about actually defining the objects precisely (notes, tasks, projects, people etc) and defining methods for performing well defined operations against these objects. The agent surface has, as you rightly point out, shrunk to a translation layer that converts natural language to commands and args that pass the input validator.
Such an LLM might have fared better with the strawberry test.
Are google search results modifying your software at runtime?
Take or agent chat for example, the output text is a ui, agents can generate charts and even constrained ui elements.
Isn’t that created and adapted at run time?
If you mean like agents live modifying your code. I think that’s pretty much here as well. Can read the logs and send prs.
The only thing is how fast that loop will execute from days or hours to mins or seconds, and what validation gates it needs to pass.
My git repo is pretty much self modifying personal software at this point, that I interface through the ide chat window.
But I don’t think we will ever lose the intermediary deterministic language (code) between the llm and the execution engine.
It would be prohibitively expensive to run everything through models all the time.
But I am starting to think we need a more precise language than English when talking with LLMs. That can do both precision and ambiguity when you need either.
Good luck with that. Users will flood you with complaints if a button moves 5px to the left after a design update. A program that is generated at runtime, with not just a variable UI but also UX and workflows, would get you death threats.
The problem is that outside of that most people want boring and regular interfaces so they can get in and solve the problem and get out - they don't want to "love" it or care if its "sexy" they want it to work and get out of the way.
LLMs transmogrifying your software at ever request assumes people are software architects and creators who love the computer interface, and that just doesn't describe the bulk of the population.
Most people using computers use the to consume things or utilize access to things, not for their own sake, and they certainly don't think "what if I just had code to do x..." unless x is make them a lot of money.
I just had Claude write itself a couple shell scripts to handle a bunch of common cases (like running tests) in my workflow where it just couldn't figure it out efficiently. Now it just runs those tools and sets things up instead of spinning in circles for half an hour.
Every time it tries to ask me if it can run some one-off crazy shell or python one-liner to do something, I've started asking myself if I should have it write a tool I can auto-approve instead.
However, there are some things that I think need a foundational next-generation improvement of some sort. The way that LLMs sort of smudge away "NEVER DO X" and can even after a lot of work end up seeing that as a bit of a "PLEASE DO X" seems fundamental to how they work. It can be easy to lose track of as we are still in the initial flush of figuring out what they can do (despite all we've already found), but LLMs are not everything we're looking for out of AI.
There should be some sort of architecture that can take a "NEVER DO X" and treat it as a human would. There should be some sort of architecture that instead of having a "context window" has memory hierarchies something like we do, where if two people have sufficiently extended conversations with what was initially the same AI, the resulting two AIs are different not just in their context windows but have actually become two individuals.
I of course have no more idea what this looks like than anyone else. But I don't see any reason to think LLMs are the last word in AI.
This is essentially declarative programming. Most traditional programming is imperative, what most developers are used to - I give the exact set of instructions and expect them to be obeyed as I write them. Agents are way more declarative than imperative - you give them a result, they work on getting that result. Now the problem of course, is in something declarative like say, SQL, this result is going to be pretty consistent and well-defined, but you're still trusting the underlying engine on how to go about it.
Thinking about agents declaratively has helped me a lot rather than to try to design these rube-goldberg "control" systems around them. Didn't get it right? Ok, I validated it's not correct, let's try again or approach it differently.
If you really need something imperative, then write something imperative! Or have the agent do so. This stuff reads like trying to use the wrong tool for the job.
And then you run into similar issues as the llm does, like silent failures, loops, contradictions unless you're very careful.
The essence might be the same closed world assumption problem. In llm case this manifests as hallucination rather that admitting it does not know.
This is the only way to guarantee AI usage doesn't burn you. Any automation beyond this is just theater, no matter how much that hurts to hear/undermines your business model.
A bird sings, a duck quacks. You don't expect the duck to start singing now, do you?
Slash commands, for instance, are a misfeature. I should never have to wait for the chatbot finish a turn so that I can check on the status of my context window or how much money I've spent this session. Control should be orthogonal to the chat loop.
Even things that have nothing to do with controlling the text generator's input and output are entangled with chat actions for no good reason except "it's a chat thing, let's pretend we're operating an IRC bot".
There are a zillion LLM agents out there nowadays, but none of them really separate control from the agent loop from presentation well. (A few do at least have headless modes, which is cool.)
I get what you're trying to say but in practice architecting what you propose is considerably more difficult. Why not build it and try to get hired by one of the bigcos?
They just want features. They don't really care about duplicated work, so half of them reinvent the TUI rendering wheel. Pluggability is something that might be actually hostile to their interests in lock-in. And the AI labs probably think "after a couple more scaling cycles, our models will be so good that our agents can just rewrite themselves from scratch"; until they hit a compute or power wall, it always looks rational to them to defer rearchitecting.
Another real possibility is that if you work on an agent with a really clean architecture and publish it in hopes of getting hired by some AI company, all of them think "that looks great, but we don't want to rearchitect right now". Your code winds up in the training set, and a year and a half from now, existing agents can "one-shot" rewrites along the lines of your design because they're "smarter".
As for me, I'm not that interested, personally. There are other things I want to build and I'm working on those.
In the GUI I can see the context indicator and usage stats.
It also makes it easier to jump between conversations and see the updates.
Sometimes I use Claude Code or opencode in the terminal, and my experience is much poorer compared to the Codex desktop app.
My personal opinion is that AI and agents are being misrepresented… The amount of setup, guidance and testing that’s required to create smarter version of a form is insane.
At the moment my small test is: Compressed instructions (to fit within the 8k limit) 9 different types of policies to guide the agent (json) 3 actual documents outlining domain knowledge (json) 8 Topics (hint harvesting, guide rails, and the pieces of information prepared as adaptive cards for the user) 3 Tools (to allow for connectors)
The whole thing is as robust as I can make it but it still feels like a house of cards and I expect some random hiccup will cause a failure.
0 - https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-...
Making an unreliable, nondeterministic system give reliable results for a bounded task with well-understood parameters is... like half of engineering, no?
There's a huge difference between "generate this code here's a vague feature description" and "here's a list of criteria, assign this input to one of these buckets" -- the latter is obviously subject to prompt engineering, hallucination, etc -- but so can a human pipeline!
...which is why we write deterministic code to take the human out of the pipeline. One of the early uses of computers was calculating firing tables for artillery, to replace teams of humans that were doing the calculations by hand (and usually with multiple humans performing each calculation to catch errors). If early computers had a 99% chance of hallucinating the wrong answer to an artillery firing table, the response from the governments and militaries that used them would not be to keep using computers to calculate them. It would be to go back to having humans do it with lots of manual verification steps and duplicated work to be sure of the results.
If you're trying to make LLMs (a vague simulacrum of humans) with their inherent and unsolvable[1] hallucination problems replace deterministic systems, people are going to eventually decide to return to the tried and true deterministic systems.
1: https://arxiv.org/abs/2401.11817
But if you're trying to tell me that every time you list criteria you get them all perfectly matched, you're clearly gifted.
Somewhere in between that I guess is the varying levels of intelligence more likely able to make the “right” decision for anything you throw at it.
In the real world almost nothing runs like that - only software and even that has a lot of failures.
So perhaps rather than trying to make agents run deterministicly the goal is to assume some failure rate and find compensation control around it.
I feel hooks are integral part of your code harness, that’s only deterministic way to control coding agents.
I am finding that the better the quality gates are the lower quality llm you can use for the same result (at a cost of time).
I've tried doing something similar with AI by running a prompt several times and then have an agent pick the best response. It works fairly well but it burns a lot of tokens.
Agents aren't reliable; use workflows instead.
Phase 1: only test files may be altered, exactly one new test failure must appear.
Phase 2: only code files may be altered. The phase is cleared when the test now succeeds and no other tests fail.
If you get stuck, bail and ask for guidance
https://github.com/yehudacohen/open-artisan/
Hopefully, I'll merge in my large structural changes in the next couple of weeks. These structural changes will enhance the state machine meaningfully, as well as adding support for hermes agenet.
Still have yet to see a universal treatment that tackles this well.
I see this as the most robust way to build a predictable system that runs in a controlled way while taking advantage of probabilistic AIs while reducing the impact of their alucinations.
LLMs simply can't be trusted to follow instructions in the general case, no matter how much you constraint them. The power of very large probabilistic models is that they basically solved the _frame problem_ of classic AI: logical reasoning didn't work for general tasks because you can't encode all common sense knowledge as axioms, and inference engines lost their way trying to solve large problems.
LLMs fix those handicaps, as they contain huge amounts of real world knowledge and they're capable of finding facts relevant to the problem at hand in an efficient way. Any autonomous system using them should exploit this benefit.
Both designs (Lightroom, game engines) have worked successfully.
There's probably nothing that prevents mixing both approaches in the same "app".
https://arxiv.org/abs/2312.00990
https://github.com/salesforce/agentscript
https://yogthos.net/posts/2026-02-25-ai-at-scale.html
1. an adversarial agent harness that uses one agent to create a plan and implement it, and another to review the plan and code-review each step.
2. an agentic validation suite -- a more flexible take on e2e testing.
3. some custom skills that explain how to use both of those flows.
With this in place you can formulate ideas in a chat session, produce planning artifacts, then use the adversarial system to implement the plans and the validation layer to get everything working e2e for human review.
There are a lot of tools you can use for these things but I chose to just build the tooling in the repo as I go.
https://x.com/i/status/2051706304859881495