FR version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
63% Positive
Analyzed from 5027 words in the discussion.
Trending Topics
#code#context#claude#don#memory#more#need#write#better#models

Discussion (141 Comments)Read Original on HackerNews
”compare these three cars. Oh btw I am a data engineer, and my moms maiden name is Joana, and I am allergic to bad poetry. And code should be DRY, I prefer SQL over Python and what’s the most poisonous flower in Scandinavia?”.
I’ve had so much wierd output because context is ”””memorized””” and bleeding into completely unrelated projects and conversations. It’s the first feature I turn off.
My guess is that has something to do with the training process leaving models unable to differentiate between “what’s happening now” and “what happened before”. Perhaps if making inferences from memories was actually part of the training process things would be different but my sense is that as an inference-time-only feature this just gets the models confused.
And LLMs are NOT intelligent enough to survive even mild context poisoning.
It'll assume I own a datacenter and have lots of gpus just because I asked to research things.
Session logs can absolutely be useful, but not when building further. It's just that that the place they slot in is during validation. You know, that place between the markdown plan and CI passing, where there's 800 new lines of code and it all seems sort of fine when you click around?
Session logs can show you what sort of manual validation happened. CI will run the tests you had, and the code will show you what new unit tests were added, but session logs can show you that the agent drove the app with Playwright, or that the agent read and considered the prod config as well as the dev config.
Nothing bulletproof, but not every piece of validation work merits a test in the repo that lives forever. We've gotten a lot of mileage out of re-analyzing the sessions, figuring out where the agent made decisions without asking, and forcing the agent to consider validation for those decisions. That's the sort of thing that's hard to dictate up front but easy to highlight with the session logs.
I've been using a custom harness based on https://minimal-agent.com/ (itself based on swe-mini-agent), which is like 50 lines for the core logic. Bash is all you need.
For small tasks, I find it's about 8x faster (and uses 8x fewer tokens) than the standard harness for each model.
For bigger tasks I haven't tested it much. It seems to work too but I think they're a bit less focused and productive in that case. It could be that those big harnesses' 20k token system prompts are doing something important with regard to steering software development workflows. (e.g. I heard Fable has a custom system prompt in Claude Code which might explain its markedly more proactive behavior.)
So I want to say there's still a lot of value in context engineering though it seems to diminish with each model release (since they're fine tuned on mostly non stupid behavior and need less hand holding).
I can't see how it would diminish unless you are literally working on public domain stuff. Unless stuffing context becomes cost effective and will not affect AI reasoning (this will be much harder), I don't see why context engineering is here to stay until we have close to AGI.
First, I think that models still need a context layer. One way to think about 'context' is as a form of compression. You provide the model context because it makes it easier for the model to figure out what to do. Even in a world with infinite model capacity and infinite model context, this is still useful because it allows the model to avoid rederiving everything from first principles every time. As long as models perform better using fewer tokens and as long as we care about token spend, context is a useful (necessary?) shortcut.
Once you bite that you need some form of context layer, the question is which. Here I do agree that it is better to work with what the models will find familiar (markdown files colocated with code, for eg). But this speaks to over-engineered solutions not understanding their main user (the agent) more than it does the need or lack there of.
B) The other use of context is that it introduces entirely new information via RAG
B will never go away (as others pointed out). A, well that’s just something we’re all going to keep getting surprised at. We’ll barely give it any direction or context and the newer models will simply find the happy path.
The author is kind of suggesting that their context wasn’t really necessary to get the happy output, I think.
Chain of reasoning is a lot of context to guide token generation, but we simply see that newer models don’t need that context to get to the answer. I’m mostly reiterating this because there’s a hot take here, and that is this agentic stuff may be waived away by magic frontier-llm wand , all of a sudden.
I thought each new generation typically used more reasoning tokens?
Bear in mind that brain architecture is learnt too - just over a much longer timescale than an individual lifetime.
I do think we need another layer, but it should be a routing layer. I am finalizing my pi-brains extension for Pi (https://github.com/earendil-works/pi) which does this:
https://github.com/gitsense/pi-brains
Right now "humans" need to define the routing rules for how to access information, but I will support what I call "knowledge agents" that can monitor conversations to inject context when needed.
What do you think is the potential value that you might get out of this, which is not already available with the existing options?
If this works, it means we can probably get by with smaller models (since it doesn't need to know everything). LLMs are pattern matchers, and if you can provide them with the right shape (context), they should produce the expected output.
For my solution to work, you need business buy-in, which I don't think will be a problem. Enterprise wants to know how tokens are being spent, so I can see them wanting structured analysis during code reviews.
What may also not be obvious is that the information is ultimately designed to live with your code. Lessons and notes are designed to be mapped to files, so if you want to know why a piece of code is implemented in a certain way, you can have the LLM filter by files to help find the needle in the haystack.
It is a hard problem, but the only missing piece is discipline, which I believe business leaders will not have an issue with enforcing since we are ultimately talking about eliminating/significantly reducing the bus factor in our code.
If you look at https://github.com/gitsense/smart-ripgrep, you can get a better sense of how context can be injected when it is needed.
By propery categorizing lessons and notes, it should make it easy to scrub and keep up to date.
I also suggest mapping lessons and notes to files when possible to make discovery and cleanup easier.
Also if context runs out you can just do "cat todo.md | agent" and you're off to the races again.
That is a sophisticated memory system though -- maybe not to you experienced humans!
They talk about memory only being useful when guided by a human, I think the proper solution is deeper than that, it probably involves feeding the entire codebase and every agent session into a finetuning of the model, though at that point you might want some guidance to avoid feeding certain sessions into the model. Or maybe not, maybe the bitter lesson applies.
It is like an annoying friend, who remembers something from a past conversation, that you have grown and developed from, but they still want to hold it against you.
I have Claude/Codex keep logs [1]. It's just prompted in my AGENTS.md [0].
> Every session must produce one of: a session log OR a plan, and end with a written summary appended to it. Default to a log; reserve plans for substantive design work.
It's incredibly valuable. For example today I started a few sessions off like this:
- What's the status of my work on Renovate?
- I was recently working on X, find that
- Did we fix the issue with backups? What are the next steps?
- This bug came up again. Didn't we fix it already?
[0]: https://github.com/shepherdjerred/monorepo/blob/main/AGENTS....
[1]: https://github.com/shepherdjerred/monorepo/tree/main/package...
Ideal outcome is this turns into a startup. I think there's a real need for team-oriented AI to avoid siloing of knowledge.
Now, I'll agree that this is probably the sort of thing I should put in the CLAUDE.md, but in this case it wasn't on my radar to put that in my CLAUDE.md, so it was nice that it surfaced that.
It does sometimes go awry though. Today I was asking about a problem I was having authenticating, and it said "you may be running into this trusted proxy setting because you put your apps behind an haproxy". That is true of 95% of our apps, so it was worth mentioning, but in this case it was not so I had to correct it. But, I'm glad it mentioned it because if we did have it proxied it could have saved me a lot of time.
I agree with other commenters here, if anything is worth being rememebered, it will be in code comments, git commit messages, CLAUDE.md or other formal documentation. The auto memory system just causes confusion and leaves stale and outdated information written down.
Its an interesting thought experiment as well, I originally thought that having the model write down memory files by itself would be a nice addition, but after playing around with it, it became clear to me that good as an idea turns out bad in practice because the model can't correctly gauge what deserves being stored as a memory.
So you told it don't go the fuck to sleep ;)
In a project my questions are usually revolved around the same topic. Having context carried across threads actually make a lot of sense.
In the general mode where I'm expecting models to be *stateless*, having memory is very annoying.
> Don't turn a one-off or area-specific comment into a durable memory without my explicit confirmation. You have a history of over-indexing on one-offs, and those memories end up getting cited to override well-tuned skills.
(Semi-serious question)
Its certainly true at the moment, but give it 10 years and we might have systems that are much cheaper and much better at context management than they are now.
(Apologies to anyone who is under the impression that we were very likely going to be at the singularity in 10 years time. Possible != very likely)
is the conclusion really that its just more important to create proper artifacts from any tricks that got the llm to understand the code better?
is the tool for searching the history just bad?
This is infuriatingly common wrt talking/writing about how to use AI effectively. All of the "this is how you write an AGENTS.md" and "you need to talk to it like X to optimize it". Like sure, you can believe that as much as you want but unless you provide some evidence you can keep your shitty CLAUDE.md to yourself and don't pollute the whole company's git repo, thanks.
> Don't start generating an auto-memory entry before asking me. Ask first, write only if I confirm — no speculative drafting.
No more crap after this.
Incidentally I don’t recall Opus 4.8 asking me once in the past few weeks. Older models did ask semi-frequently.
I found that every model will still manually check every file/function, they immediately assume that anything in context is stale.
That's sensible because often the user edits stuff while they're running.
What it does is save it from having to grep blindly about the codebase. But I think I'd get roughly the same benefit by just dumping the function headers then.
> I believed this so strongly that my company built an entire product around this concept. I used to tell folks that "session transcripts were the new oil," that they were more valuable than the code itself.
> […]
> We don't really write code by hand anymore.
Honestly, isn't this just influencer spam? What possible value is there in reading about people who used to have products, but no longer write their own code, complaining about the inscrutable prediction machine they have handed that job and their livelihoods to?
Like, if you have complaints about the thing, perhaps you should address them to your supplier directly. None of your readers can help, and nobody's magic folk solution to your problem is better than yours.
And there are so many of these sorts of posts. Are we not entirely cooked?
(I think I have concluded that if people writing about AI aren't writing about interesting things they have achieved with small, local LLMs — which for clarity I am fully interested in reading - then I'm done reading. This whole blogging-about-cloud-AI genre is just weird and irresponsible now)
It’s gonna be a living breathing world, you see. You’re going to be like “omg, this game even accurately captured the blog posts, woah”.
Edit:
This whole blogging-about-cloud-AI genre is just weird and irresponsible now)
I sincerely never considered it was a whole genre.
Something about this idea really resonates with certain personality types. I equate it to the Zettelkasten hype phase from several years ago. People (...like me..) got really wrapped up in the belief that the process was more important that the content. "Linking" was an "activity." Something good will happen as long as you (a) take notes on stuff and (b) link them to other notes on stuff.
You see the same thing with the session transcripts people. They're building ever more sophisticated setups of indexing and storing and cross referencing every conversation they've ever had on the (I would argue) mistaken belief that the transcripts are the valuable part, rather than the uncomfortable part where you go do something. A lot of it, I say from falling in the trap, is fancy procrastination.
(Although, I have found myself jealous on many occasions where their fancy system retrieves something they vaguely recall from a conversation they had 3 months ago. So, who knows.)
Like ancient people? Because "new oil" whilst I get what it might imply sounds bad to me. Oil has been superseded in many places so "new oil" is like going backwards still.
Reference: data is the new oil is a term coined in 2006.
We're in 2026. See what I mean.
I think you may just misunderstand the point of having / writing a personal blog. I write because it's fun! Whether the reader gets any value out of reading it is almost entirely beside the point.
(Also several comments here directly post a fix to the problem stated in the blog post, so readers can and do often help)
I used to blog, as it goes, and I have supported and enabled many more, so no, not really.
I'm trying to rebuild my life so I am in an experimenting and learning phase rather than a massive coding phase, and most of my code work is maintenance of things I have built. That which I do code, I am still coding by hand, though I am dealing with other people's Claude output and I am really unimpressed by it. It's often rather crass.
But I would say to you that if you personally don't write code now but you do have a dependency on one of two presumably unprofitable cloud AI providers, aren't you in trouble? How is this not a three-alarm fire for you?
Unfortunately the point of code is rarely to impress people (certainly not other engineers) or to avoid being "crass." 99.99% of code exists to achieve business outcomes, and velocity matters a lot in many contexts. A lot more than elegance or impressiveness.
The platform risk is a valid concern but alleviated by China's theft and redistribution of open models.
Sometimes it takes me a day or more to find the one line fix or abstraction necessary, while claude can hammer through a hundred line fix in under an hour.
Quick and cheap are two of the three fabled: "Fast, cheap, and good: choose two"
Same thing with hobby projects - I might ask ChatGPT or Gemini some questions about best practices in Swift for example, but writing code is done by hand.
As others said - if you don't use it, you'll lose it. And I'd rather keep my skills up to date.
And I don't think I'm unique. I see enough posts like https://news.ycombinator.com/item?id=48777257 pop up that I'm reasonably confident all the hype around LLMs saving so much time and increasing productivity so much is, well, just that: hype.
Sure, if you can't code at all and want to build something, an LLM is going to be great for you, even if you can't evaluate the code quality or determine if there are bugs just by looking at the code. But I've been coding professionally for 25 years, and as a hobby since I was like 8 years old. I like to code! It's a passion of mine. If the LLM isn't doing it faster or better (and most of the time it isn't), why wouldn't I write code myself?
I'll have the LLM write boilerplate stuff or do tedious refactoring, because I just don't feel like it (even if it does take longer). But for the real work? Of course I do most of it myself.
One area where the LLM shines for me is finding the root causes of bugs. It can generally do that much faster than I do. Often orders of magnitude faster (like minutes instead of hours or days). But when it comes to write the fix for the bug? It's usually faster and better if I do it myself.
More generally I am interested in burnout-avoidance tools; things that help me start, finish, things that write tests I guess, certainly code scaffolding.
But I am fully unconvinced that my burnout will be improved by ending up owning the responsibility for wobbly or inscrutable AI-generated code with potential landmines in it; that will keep me up at night just the same.
This is pretty funny because it's about the depth of understanding of every 'AI expert' on Linkedin. People who praise the context window as basically magic have no idea how any of this works.
"Spicy Autocomplete", I've heard it called.
Toggle it off and never think about it again.
I refuse to believe this is true. The ability for an agent to find information from before a compaction is incredibly useful. At compaction time it's impossible to know what exactly may be still needed.
Not that this isolated article is super damning or anything, but the accumulated set of all these reports has left me only empathetic, I think, of these other devs. Like, I just want to tell them, "it can be ok, it doesn't need to be like this.."
I think Opus might be on similar level for most of what I'm doing, but I haven't used it much recently, so I can't remember the difference. So I guess I'll find out on the 7th when they pull the plug again! (Free-ish trial of Fable ending.)
That being said, I tried using other frontier models to help with a Pong clone the other day and they were introducing new bugs at approximately the same rate as they were fixing it. On Pong!! I found that amusing because I couldn't think of a simpler game, so it didn't inspire confidence.
Fable's doing just fine on an online multiplayer game though. I have no idea how that works. (Maybe it would fail Pong too?? I haven't tested that!)
>> We don't really write code by hand anymore.
The software world is very close to building a super intelligent senior software developer. Companies like this will ask all the best things a software engineer does automatically. Now claude will add it into the coding agents itself.
Damn, I didn't see this coming.
Its first the build the intelligent builder. We will figure out what we want to build later.
Edit: Before more people take it seriously. This is sarcasm. I don't wish this.
Once the automator automates itself fast enough, we won't have the ability to opine what gets built. The LLM will decide. Just like right now sometimes LLMs delete tests so they pass, they could just delete humanity if humans get in their way.
Yeah. Two more weeks, as they say. Just need to iron out some kinks.
You can rely on it like 95% of the time but that means if you keep it running continuously the error rate rapidly approaches 100%. That's getting a little better with each release, and it might actually hit the point where you can more or less trust it indefinitely (on well defined workflows).
Or at least it would, if context window permitted...
Except Claude is more expensive than an actual senior software developer. Otherwise, why are many companies terrified of the usage bill that gets printed on the invoice?
The nonsense in "tokenmaxxing" was a complete marketing scam and illusion of cheap tokens which in reality were heavily subsidized.
The entire point is detecting bad code before it reaches production. [0] AI generated or not.
[0] https://sketch.dev/blog/our-first-outage-from-llm-written-co...