ES version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
84% Positive
Analyzed from 1510 words in the discussion.
Trending Topics
#models#local#model#run#small#using#ram#coding#qwen#system

Discussion (39 Comments)Read Original on HackerNews
I consider Gemma 4 31B (dense / no MoE), the new baseline for local models. It's obviously worse than the frontier models, but it feels less like a science experiment than any previous local model I’ve run, including GPT OSS 120B and Nemotron Super 120B.
On my M5 Max with 128 GB of RAM and the full 256K context window, I see RAM use spike to about 70 GB, with something like 14 GB of system overhead. A 64 GB Panther Lake machine with the full Arc B390, or a 48 GB Snapdragon X2 Elite machine, could probably run it with a 128K to 256K context window. Maybe you can squeeze it into 32GB (27.5GB usable) with a 32K context window?
Even last year, seeing this kinda performance on a mainstream-ish/plus configuration would have seemed like a pipe dream.
Same laptop, and my contrived test was having it fix 50 or so lint errors in a small vibe-coded C++ repo. I wanted it to be able to handle a bunch of small tasks without getting stuck too often.
GPT OSS 20B was usable but slow, and actually frequently made mistakes like adding or duplicating statements unnecessarily, listing things as fixed without editing the code, and so on.
Qwen 3.5 9B with Opencode was much faster and actually able to work through a majority of the lint warnings without getting stuck, even through compaction and it fixed every warning with a correct edit.
I tried 4bit MLX quants of Qwen 3.5 9B but it eventually would crash due to insufficient memory. I switched to GGUF, which I run with llama.cpp, and it runs without crashing.
It is absolutely not comparable to frontier models. It’s way slower and gets basic info wrong and really can’t handle non trivial tasks in one go. I asked it for an architecture summary of the project and it claimed use of a library that isn’t present anywhere in the repo. So YMMV, but it’s still nice to have and hopefully the local LLM story can get much better on modest hardware over time.
This is not said often enough.
Yes, local LLMs are great! But reading most HN posts on the subject, you'd think they're within reach of Opus 4.7.
There is a very small, very vocal, very passionate crowd that dramatically overstates the capabilities of local LLMs on HN.
As much fun as it is to run these things locally don’t forget that your time is not free. I am slowly migrating my use cases to openrouter and run the largest qwen model for < $2-3/day with serious use for personal projects.
I have a brand new M5 MacBook Pro - top end with all the specs and I've tried local models and they're barely functional.
1) control 2) privacy 3) transparent cost model
Cloud has tremendous value for speed, plug and play, and performance. You need to decide how those compete with the benefits of local - both today, and a year from now, e.g.
But the are interesting and fun to play with! I do a LOT of work on local agent harnesses etc, mostly for fun.
My current project is a zero install agent: https://gemma-agent-explainer.nicklothian.com/ - Python, SQL and React all run completely in browser. Gemma E4B is recommended for the best experience!
This is under heavy development, needs Chrome for both HTML5 Filesystem API support and LiteRT (although most Chromium based browsers can be made to work with it)
It's different to most agents because it is zero install: the model runs in the browser using LiteRT/LiteLLM (which gives better performance than Transformers.js), and Filesystem API gives it optional sandbox access to a directory to read from.
It is self documenting - you can ask questions like "How is the system prompt used" in the live help pane and it has access to its own source code.
There's quite a lot there: press "Tour" to see it all.
Will be open source next week.
Vibe coders out here thinking all software development is solved by because they made an (ugly and unoriginal) dashboard for their SaaS clone and their single column with 3x3 feature card landing page thats identical to every other vibe coders "startup"
If you are spending $800/month on tokens you are likely to notice degradation for local models compared to near-frontier models. The models I can run locally are consistently worse than Claude Sonnet 4.6 (again for the work I give them), although Qwen3.6 does feel almost like magic for its size because it can do a lot. The really big open-weight models should be better, but they want 200+GB RAM, which will need a correspondingly expensive CPU.
How long do people realistically expect a laptop to stay competitive with SOTA local models? Especially in a space where model sizes, context windows, and inference requirements keep moving every year.
And even if the hardware lasts, the local experience usually doesn’t. A heavily quantized local model running at tolerable speeds on consumer hardware is still nowhere near frontier hosted models in reasoning, coding, multimodal capability, tool use, or reliability.
The economics just don’t make sense to me unless you specifically need offline inference, privacy guarantees, or low latency for a niche workflow. Otherwise you’re tying up $10k upfront to run an approximation of what you can already access through a subscription that continuously improves over time.
You could literally put the difference into index funds and probably cover the subscription indefinitely from the returns alone, even accounting for gradual price increases.