RU version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
74% Positive
Analyzed from 1511 words in the discussion.
Trending Topics
#context#models#coding#model#https#local#run#prompt#more#using

Discussion (43 Comments)Read Original on HackerNews
I feel like there's potential to improve that part of just cutting a tunable amount (τ) of tokens off the tail end of those traces, because you may potentially lose valuable insight earlier in the trace? They did train the model (in SFT) to put the relevant information into the tail (τ) of the trace, but I'm not sure this is the best possible way.
0. https://arxiv.org/pdf/2509.26626
1. https://arxiv.org/pdf/2510.06557
I think this is very important to eventually become a viable replacement for coding models. Because most of the time coding harnesses are leveraging tool calls to gather the context and then write a solution.
I am hopeful, that one day we can replace Claude and OpenAI models with local SOTA LLMs
It is more finicky than Claude but if you hand hold it a bit it's crazy.
So yeah, while it's true that qwen3.6 is good for agentic coding, it's not very good for exploring the codebase and coming up with plans. You need to pair it today with a model capable of ingesting the whole context and providing a detailed plan, and even then the implementation might take 10x the amount of time it'd take for sonnet or Gemini 3 to crunch through the plan.
EDIT:
My setup is really as simple as possible. I run ollama on a remote server on my local network. In my laptop I set OLLAMA_HOST and do `ollama pull qwen3.6:27b`, which then becomes available to the agent harnesses. I am not sure now how I set the context, but I think it was directly in oh-my-pi. So server config- and quantization-wise, it's the defaults.
[1]: https://itayinbarr.substack.com/p/honey-i-shrunk-the-coding-...
I also go outside for fresh air while I wait for a session to run.
They're also pretty terrible at summarization. Pretty much always some file read or write in the middle of the task would cross the context margin and it would mark it as completed in the summary. I think leaving the first prompt as well as the last few turns intact would improve this issue quite a lot, but at low context sizes thats pretty much the whole context ...
But as soon as you go below Q8, the models get stuck in repeating loops, get the tool calling syntax wrong or just starts outputting gibberish after a short while.
EDIT: just found this recipe repo, may wanna give it a go: https://github.com/noonghunna/club-3090
EDIT-2: this can also shave off a lot of context need for tool calling -> https://github.com/rtk-ai/rtk
These are two realworld experiments, whose results are disappointing for those expecting levels of performance comparable to cloud services:
- https://deploy.live/blog/running-local-llms-offline-on-a-ten...
- https://betweentheprompts.com/40000-feet/
The first is even the 35b version of qwen3.6.
On a real GPU using 27b with the latest quants the experience is better. It's still not the same as opus running on a subsidized GPU farm. Well it is better for privacy at least.
I find it interesting how 2 people can read the same thing and come to very different conclusions.
Is that 128gb RAM or VRAM?
it did it successfully, but it did need a follow up correction prompt, overally pretty impressive for a model with 760M active parameters, but definitely not deepseek-r1 level
that being said, if something with 760M active parameters can be this good then, there's a good chance it is likely that api-based models are likely to get cheaper in the future
Prompt ------
``` can you write me some js code (that i can put in the console for about:blank) which will basically create a timer for for me that i can start, stop, and store current values for (or rather lap)
so i want it to create buttons (start, stop, lap buttons) on the page for me with labels and divs and other elements that accordingly record the current information and display the current information, and can accordingly start, stop and lap :)
the js code that i copy paste automatically creates the html buttons and divs and other elements that can manage the timer and accordingly the timer works with them ```
LM studio doesn't let me actually run this yet though: "Unsupported safetensors format: null"
No I am not saying this model is a drop in Claude replacement. But I think in 2 years we might be really surprised what can be done in a desktop with commodity hardware, no connection to the internet, and a few models that span a subset of tasks.
Really happy to see amd put their hat in the ring. It's a good day for amd investors. I know a lot of AI bros will scoff at this, but having your first training run is a big deal for a new lab. AMD is on their way despite Nvidia having years of runway
same thing with smol local LLMs versus the big ones in the sky. your smol local LLM will only be able to tackle projects which are not comercially valuable anymore, because people expect 100x scope and features. which is fine as a hobby/art project
yes, we'll do amazing things with local LLMs in 2 years, but the big LLMs will do things beyond imagination (assembly vs C)
I think we are going to see a surge in software claiming to do everything and becoming bloated and unsustainable.
I already see 1gpu local models 1 shotting games via vibe coding. I see people doing agentic programming, granted more slowly and cheaply than 12 Claude sessions.
The difference isn't as big as it was 2 months ago. In the past 45 days so many model releases have happened. Meanwhile frontier performance has stagnated and degraded. If it's a taste of what is to come I welcome it.