Running local models on an M4 with 24GB memory

sshintoist about 2 hours ago 28 commentsRead Article on jola.dev

DE version is available. Content is displayed in original English for accuracy.

⚡ Community Insights

Discussion Sentiment

94% Positive

Analyzed from 961 words in the discussion.

Discussion (28 Comments)Read Original on HackerNews

sourc3•about 1 hour ago

I am running qwen 3.6 9b quantized model on my m4 pro 48gb and it is barely useful to do some basic pi.dev/cc driven development. I think 128gb desktops are the sweet setup to actually get meaningful work done. However, getting your hands on one of these machines is difficult at the moment.

As much fun as it is to run these things locally don’t forget that your time is not free. I am slowly migrating my use cases to openrouter and run the largest qwen model for < $2-3/day with serious use for personal projects.

sjones671•about 1 hour ago

Thanks for saying this. There's so much nonsense out there online about local models being better than Opus 4.7 and the like. It's just not true for regular users.

I have a brand new M5 MacBook Pro - top end with all the specs and I've tried local models and they're barely functional.

SecretDreams•1 minute ago

The main benefits for local are:

1) control 2) privacy 3) transparent cost model

Cloud has tremendous value for speed, plug and play, and performance. You need to decide how those compete with the benefits of local - both today, and a year from now, e.g.

Yukonv•36 minutes ago

What models and quantizations have you been trying? I've had great success with the larger Qwen 3.x models at 6-bit levels. Using 6 bit quantization is really the bare minimum to give local models a fair shot at agentic flows. Once you start pushing below that the models become more "dumb" from the limited bit space.

carbocation•about 1 hour ago

Was the choice of such a small model driven by a desire for high tok/sec? I ask because an m4 pro 48gb machine can run larger models (if model intelligence is the thing that would make it more useful).

sourc3•about 1 hour ago

Yes that was my goal. Also noticed a huge performance gain going from ollama to mlx. Your mileage may vary.

elij•about 1 hour ago

I'm using the 30b MOE model on same spec with 65k tokens as a sub agent with tooling and it absolutely writes decent code. The dense 9b I agree wasn't great.

hparadiz•about 1 hour ago

How does it (the openrouter version) compare to ChatGPT 5.5 or Claude Opus 4.6?

sourc3•about 1 hour ago

Good enough. It gets 60-70% of the work I need done for a lot less $ (keep in mind I am using these for personal projects that doesn’t generate revenue). If I was using it with the hopes of making money I think I would just use Codex at this point.

BoredPositron•about 1 hour ago

Use the small models for small tasks. Like cli auto complete, file sorting, small scripts, config files, setting up tooling, grammar, simple translations there is so much use in them.

nl•about 1 hour ago

I think it's useful to be realistic about what you can do with a local model, especially something as small as the 9B the author is using. A 9B model is around the level of Sonnet 3.6 - it can do autocomplete and small functions but it loses track trying to understand large problems.

But the are interesting and fun to play with! I do a LOT of work on local agent harnesses etc, mostly for fun.

My current project is a zero install agent: https://gemma-agent-explainer.nicklothian.com/ - Python, SQL and React all run completely in browser. Gemma E4B is recommended for the best experience!

This is under heavy development, needs Chrome for both HTML5 Filesystem API support and LiteRT (although most Chromium based browsers can be made to work with it)

It's different to most agents because it is zero install: the model runs in the browser using LiteRT/LiteLLM (which gives better performance than Transformers.js), and Filesystem API gives it optional sandbox access to a directory to read from.

It is self documenting - you can ask questions like "How is the system prompt used" in the live help pane and it has access to its own source code.

There's quite a lot there: press "Tour" to see it all.

Will be open source next week.

rtpg•about 1 hour ago

What kinda harness do people use with these local models? I am quite happy with the Claude Code permission model and interface in general for coding stuff (For chat-y interfaces I have no real opinion)

nu11ptr•about 1 hour ago

Still trying to understand if a Macbook Pro M5 Max with 128GB is likely going to be able to run coding models well enough that I can cancel my Codex, or even go down to the $20/month plan.

guessmyname•about 1 hour ago

A 128GiB MacBook Pro in Canada is what, north of CAD $11k after tax? That’s around USD $7k. At $20/month for a cloud AI subscription, you’re looking at almost 30 years of service for the same money.

How long do people realistically expect a laptop to stay competitive with SOTA local models? Especially in a space where model sizes, context windows, and inference requirements keep moving every year.

And even if the hardware lasts, the local experience usually doesn’t. A heavily quantized local model running at tolerable speeds on consumer hardware is still nowhere near frontier hosted models in reasoning, coding, multimodal capability, tool use, or reliability.

The economics just don’t make sense to me unless you specifically need offline inference, privacy guarantees, or low latency for a niche workflow. Otherwise you’re tying up $10k upfront to run an approximation of what you can already access through a subscription that continuously improves over time.

You could literally put the difference into index funds and probably cover the subscription indefinitely from the returns alone, even accounting for gradual price increases.

nu11ptr•18 minutes ago

You are assuming I'd only get it for that. That would probably just be the straw that broke the camels back, but I'm already thinking about a purchase even if that doesn't work out.

Yukonv•about 1 hour ago

Have been using Qwen 3.6 27b recently along with various other models the last month and it is very capable for writing code at a level I haven't need to use a subscription for 95% of what I throw at it. Been using it to write extensions for Pi to expand tool kit without much fuss as one example. Is it as fast or SOTA? No, but you can't ignore how functional it is on hardware you own. Where it can begin to struggle is giving too open ended prompts or investigating complex technical issues. At that level its knowledge is not high enough to solve those problems on its own.

canpan•about 2 hours ago

Recent models (Qwen 3.6 and Gemma) can really do coding locally. Feels like SOTA from maybe a year ago? But you would want about 32-40GB total memory. 24GB is just a bit short of that. A gaming PC with 16GB graphics card and 32GB RAM brings you very close to a usable coding system.

ai_fry_ur_brain•about 1 hour ago

"Coding system" "can really do coding locally"

Vibe coders out here thinking all software development is solved by because they made an (ugly and unoriginal) dashboard for their SaaS clone and their single column with 3x3 feature card landing page thats identical to every other vibe coders "startup"

DrBenCarson•about 1 hour ago

How are you using that RAM with the GPU?

canpan•about 1 hour ago

Llama.cpp with automatic offload to main memory. You can also use Ollama, it is easier, but slower.

NBJack•about 2 hours ago

I'm puzzled. The M4, as far as I know, doesn't have 24GB. Did the author mean a M40?

tra3•about 2 hours ago

There’s definitely an option with 24 gigs of ram: https://support.apple.com/en-ca/121552

sertsa•about 2 hours ago

M4 Mac Mini w/24GB sitting right here on my desk.

spoonyvoid7•about 2 hours ago

M4 = M4 Macbook Pro

teaearlgraycold•about 1 hour ago

Or Air

sbassi•about 2 hours ago

A useful data to know about this setup is how many tokens/sec generates.

JBorrow•about 2 hours ago

It’s started in TFA

NDlurker•about 2 hours ago

You can't expect someone to read 4 paragraphs into an article before commenting

kennywinker•about 2 hours ago

@grok is this true?