ZH version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
70% Positive
Analyzed from 2878 words in the discussion.
Trending Topics
#context#model#tokens#more#code#agent#claude#don#models#opus

Discussion (74 Comments)Read Original on HackerNews
I can keep the same high level conversation going for an entire day over a million LOC+ codebase without ever hitting meaningful token limits. No compaction or summarization tricks needed. I can burn 50 million tokens in recursive calls and still not touch 100k tokens in my root conversation thread.
There is some rework needed to "bootstrap" the agent each time it has to descend back into Narnia, but this is still far more efficient than carrying around one big flat context that tries to cover everything all the time.
Recursion is very effective at controlling token use, but it can only go so far. I've not observed any uplift for recursive depth beyond 1. I have seen the agent attempt it a few times, but the practical performance is simply not there. External symbolic recursion does not appear to be something the frontier models have been trained for. They are fantastic at emulating recursion in context, but we don't want that if we are trying to achieve a reduction in token use.
I don't bother warning it in the system prompt anymore. It's pointless. I let it bump its head as required. A few hundred tokens and the agent is back on track each time.
The only tools permissible to root in my scheme are call() and return().
(And I rarely fill the context window that far anyway when working on a single task, or a series of tasks that are related enough to warrant the same context; more typical is anywhere between 200k and 600k or so.)
I'm not saying that no one ever has this experience, but it's odd to me that some people see it so often that it warrants giving it a name.
Personally I consider < 60k to be the smart zone for opus. This is worse for opus 4.7 and 4.8 cause of the more granular tokenizer
60k isn't much bigger than the system prompt.
Do you have any old documentation that it's picking up and referencing? If you set all claude settings back to default do you see the same issue?
100k tokens "by lunch" is also not my finding, the newer models will hit that already right in the initial exploratory phase
- Work Mode - HITL/AFK
- Problem Statement
- Who It Affects - Primary / Secondary User
- User Stories
- Business Case
- Why Now
- Success Critera
- In Scope/Out of Scope [Out of Scope v. important)
- Thinnest Slice (This I've found super valuable, means you max out the amount of 'product' for your buck and avoid diminishing marginal returns or overbuilding. Often I will build this)
- Eigenfeature - What is the larger feature we _could_ (but probably won't) which would solve for this use case and other stuff I might not have thought of
- Technical Notes
- Deps
- Schema Changes
- Risks
- Final Recommendation [go / no go, including on scope]
There's a note in my Claude / Agents MD which says no net new feature gets introduced without this and I get it to move through a pipeline of folders (active, approved, shipped, proposed etc). All runs in a system of MD files and have even created a little MD Kanban from the metadata!
I then start a fresh conversation, make it analyze the design document and code, and for larger changes, generate a high-level implementation document which includes concrete phases or steps. I review this plan and iterate if necessary.
Then for each phase I make it generate a detailed plan for that phase and save it along side the other documents. Once the phase is over, I make it write a summary of what was done, decisions made and reasons for it. And typically a good point to compact the model's context.
These documents gives additional context for when I make another model do code review, and help illuminate drift or gaps from the main design document.
Maybe I could achieve better and quicker results with keeping the context in the proper zone, but trying it will have to wait until the next project.
But, this is also why so-called "memory" systems are usually a mistake that make the models dumber. They don't have memory, they only have context, and every irrelevant fact you shove into the context is less context for the problem. Less distractions, better results.
The way to have the agent remember things is to have it document its work, like a human developer would do if they wanted their project to be friendly to other developers working on it. Good developer docs with an index page and a good plan with checklists, in concise Markdown files, checked in to the repo is the ideal memory for models and the ideal docs you need to figure out WTF the model has been up to. Helps with code review, too, whether by humans or another model. There's no down side.
But, it does a good job following existing conventions in a codebase, as long as they're really consistent. So the more actively you enforce that consistency the more likely it is to do the right thing without memories or prompting.
I don't like "never do" or "always do" type rules in AGENTS.md or in memory, as it often over-interprets them and ties itself in knots trying to satisfy an impossible set of goals.
I do my own framework and spend a lot of time trying to debug this and it’s not so much the context size in hard numbers but rather the probability that there is debris or wrong directions in the window that are drowning out the things the user thinks are important.
This manifests in the llm that keeps going back to doing the thing that failed when they tried it just before the last approach etc. The frequency of things in the context window give weight even if they are the wrong things.
I have a lot of tricks like not giving the llm lots of tools but rather giving it a tool it can use to search for tools etc.
But the bigger solution is in process where you use something like superpowers to force the llm through stages and you control the context that carries forward.
When it comes to source code, I feel like LLMs could just as well work with something like minified source code, if an LLM is trained on programming well, I think there's no reason why something like a variable should be represented by something more than a single token. Comments can be discarded, etc. In fact considering embeddings for LLMs are very rich, I think common ops could be reduced to a single token.
Imo that's why LLMs are soo good at reverse engineering. A lot of the time, assembly (with symbols) is pretty close to the source code, but compressed and encoded, and if you're familiar with the patterns of your compiler, reversing it is not that difficult.
Anyways, context engineering could be huge boon to input token curation imo (and maybe it already is)
Personally, I already see LLMs and agents as blackboxes. I give each feature request to multiple LLMs and then compare the results. I don't manually use "sessions" at all. I just look at the outcome. When I dislike it, I "git reset --hard", change my prompts and restart the feature request.
To have an ongoing sense of which agents perform best, I keep a log and calculate an ELO score of which agents meet my demands best. This score is imporant to me, not so much how the agent achieves it.
In an interactive session, adding "Fine, but make the button red" after the model generated a first solution more than doubles the tokens used. As the model now not only gets the original code and the feature request but also the updated code plus the change request as input tokens.
Sending a feature request to an LLM and then sending the feature request again with "The button shall be red" only doubles the tokens used.
It seems obvious. Moreover, in a simple model, it seems like whatever tokens you do add have to have MORE information than the average in the existing window.
In a non-trivial model (and this is the model I would choose), since you are adding them to the end, they likely have to have MUCH more information.
Proof as always is an exercise to the reader.
[1] https://pi.dev/
Admittedly I have been doing this precautiously, based on anecdotal evidence, not because I had bad experiences with longer context deterioration myself.
In the brief time I had access to Fable 5, it went on long running tasks (>45 mins) into the 30-40% zone without apparent context coherence problems.
Not really tho right? Since we got to 1m context in mid 2025 nearly no one has gone higher.
In essence, we run many short agent loops, generating their prompts dynamically from structured data. Each loop advances the state in a small step towards the final goal.
For example, it may be the case that a long context full of useful information relevant to the task is completely fine, perhaps even beneficial. And if the context contains a bunch of unrelated tangents and conflicting instructions, then it will be detrimental.
Have there been studies on what makes models get dumber? To what extent is context length to blame vs context quality?
https://arxiv.org/abs/2506.00069