ES version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
80% Positive
Analyzed from 10047 words in the discussion.
Trending Topics
#models#run#model#qwen#local#more#ram#don#running#macbook

Discussion (302 Comments)Read Original on HackerNews
BUT DO NOT buy this MacBook if you plan on doing serious coding using local LLMs with it. The reason is simple: your fingers will burn and your head will explode from the noise.
Running any kind of sophisticated job on the very laptop you are using is just not viable. Sure you can use it in clamshell mode, but forget touching it while working with AI coding or agents.
If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement - or at least a few meters from your desk. Connect to it over LAN or Tailscale. The MacMini will also cost you almost 1/3 of the MacBook Pro.
Thank me later.
Still, I don't agree. I think this machine is meant to use local models. You just have to wear pants if you want to keep it directly on your lap. I rarely use it that way anyway. I prefer it plugged into an external display and comfortably sitting on a laptop stand.
After about 1 minute the entire machine basically bricked and I had to hard reset :D
You would have to get a third party reseller/scalper or refurbished mac mini to get 64gb of ram ever since apple stopped selling it.
I can’t figure out when it makes sense to pay 10k up front for a quantized Llama 3.1 but it’s an interesting option
But man, I have never purchased a computer which is more expensive than a decent family car.
Some people will be happy to pay that premium for privacy, but at roughly 10X the cost of a MacBook Neo, that money could also buy a lot of credits on OpenRouter or frontier labs.
[0]: https://www.apple.com/shop/buy-mac/macbook-pro/14-inch-space...
I don't know how much serious hands-free agentic coding I will ever do on my MacBook alone, but I do know that I would not have got so far into understanding this without tinkering with local models, llama.cpp, LM Studio, and LM Studio and all that.
I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled. Because it's just huge, exhausting, jargon-drenched, unknowable, and I am over the hill at fifty-plus.
Until, that is, I could poke around with setting it up on my own (secondhand) machine, watching the API calls, understanding some of the terminology. I didn't even buy the machine for that; it's just adequate to the task.
The Neo is too small to really get much benefit from this opportunity to make it more visceral and knowable.
I don't recall any previous tech stack that was barfed onto the scene with so little background or reference material, going from zero to endless undefined jargon... and no primer in sight.
For people who demand an understanding of their tools, it's a lot of work. I recognize the value of "AI" in performing the tasks I'd have to do manually; for example, keeping the data structures of my front- and back-ends in sync in a project. But do I want to interrupt my development and take weeks off to digest all of these tools?
And if I do, I want to run the show and fully understand it. And like you, I think that's best done locally.
Cloud models still feel ‘magic’, like you send a request off and get something back, like it’s something ‘special’. I used to joke that ChatGPT might be some kind of mechanical turk underneath.
Watching a model run local on your own machine hits different — you realise that yes, it IS just a computer program. Which for me actually makes me appreciate the leap we’ve made MORE, not less. From an information-theoretic point of view, LLMs really are something special.
The fact that they are just programs, that I’ve now experienced first-hand that they’re just programs, makes all those questions around consciousness and intelligence much more interesting.
Cloud models are (much) faster, they don't consume so much power/generate heat, they have much bigger (LLM) context, they're much more precise and they have a much wider (engineering) context of the given problem.
Except privacy and use cases that are blocked by cloud models (e.g. reverse engineering), local LLMs are currently an expensive toy.
When I try to program with a local LLM (I'm on a 32/128 GB system), I end up wasting time compared to a cloud LLM.
And I can't say that I won't switch to openrouter (even just for the same models) at some point.
But one of the things I have found about my own process learning is that some lessons only come to you when you make yourself available to them. And if that means doing things the difficult way, that is what you should do.
The interesting question is whether that gap will narrow, and if so, how much, and on what timescale.
The exact answer to this question is not knowable, but if you are the kind of person who comes to a site called "hacker news", and you think there is a nonzero chance that the answer is that yes, the gap will narrow and this won't always be an expensive toy, then now seems like a pretty great time to get in the game and start exploring the capabilities.
I do realize the cloud is just someone else’s computer right? Power goes in, tokens and heat come out - just in another place
That's never the point of keeping local alternatives though.
Agree having a powerful machine is really worth it in general for professionals, but strong disagree that running local LLMs has anything to do with it. It's hard enough as it is getting a good ROI on your time/money prompting/wrangling with frontier models. IMO leaning on the comparatively limited capabilities of local LLMs is best avoided in favor of keeping your own personal coding skills fresh and continuing to learn new ones.
To me, "how do contemporary AI systems work and interact with contemporary hardware and how can I best take advantage of their capabilities?" is the set of skills that are worth learning at this moment.
What else is there? New / additional programming languages? New / additional database systems? frameworks? orchestrators? cloud provider / infra tooling? architectural patterns?
I dunno, all of this seems really boring and "been there done that" to me at this moment in time!
I needed to do this, this way, in my own time, to put my brain back together. It has worked for me, which is why I recommend it.
YMMV.
I found LM studio to be a nice starting point. Frindlier and more featureful than Ollama and not as intimidating as llama.cpp (though you will want to use that eventually)
I tried Ollama but I've settled on Unsloth Studio generally; once things really settle down I'll just run the llama-server UI, which is pretty nice.
A friend is tinkering with LLMs for amusement on a 16GB Raspberry Pi 5, and when I explained that llama.cpp now had a typical web chat interface he was so happy — it's amazing what the "table stakes" are now.
- opencode with it's webui
- deer-flow with it's research/powered front end
They both run websites so you don't have to baby sit them (eg, keep your mac open). I've build a pdf compressor over a few days by first having deer flow try and research the frameworks and pipeline. It stalls out because its not really a fluid programmer. Once it stalls out, I transferred it (manually for now) to opencode and it's refactoring it because it's just a collective bundle of sticks and it needs a lot of testing to tweak out the limited scop context. LLMs can't really hold large scopes (locally anyway, from what I've read from HN, it's possible with longer context).
It'll complete in a few days with maybe 3-4 hours of full attention interaction, but it's running 3x that without my attention. Obviously, if I paid more attention it'd run quicker, but since it's local, it's not pumping out large volumes of code, it's mostly looping over tests and capabilities as observed.
It's running Qwen3.6 35B MoE on a AMD 128GB strix halo. If I switched to the dense models, perhaps it'd be smarter, but the trade off seems to be much slower gen.
Have you tried Paseo?
I have opencode in a VM, and the paseo daemon running in the VM, and then the Paseo Mac app. Really nice.
(You can also use the Opencode GUI to frame a remote opencode web interface)
I have a pretty deep, maybe paranoid need to be confident I have an intrinsic understanding, and I have found in my life that lessons come to you when you make yourself open to learning.
So I need to build on top of what I know, taking as much of the hard way as I can bear to take at any one time — it has to be not quite difficult enough to put me off.
I can't really explain what I have learned this way that is different, but I feel it in a way that I wouldn't if I'd simply pushed a button.
For the same reason, I have a really basic 3D printer that I've set up myself, set up Klipper, configured how I want it, learned how to calibrate, all that. And now I can say that I feel I have an understanding of 3D printing. I could hold my head above water in a discussion with a real expert, maybe find work in an adjacent field where my insights would keep me grounded.
I can afford a really good printer that has all that set up, and more, has no problems. But I'd just be someone who has a 3D printer.
(Also who am I kidding about the existence of a printer with no problems)
And that's how skills die.
In theory you can also get 48GB of VRAM with, say, two 3090s, but it will take up a lot of space and generate a lot of heat compared to the Macbook Pro and GB10.
[1] https://x.com/MiaAI_lab/status/2070859135399182444
[2] https://github.com/MiaAI-Lab/Qwen3.6-27B-NVFP4-vLLM
So like... $2000+ just for the used GPUs? Plus I assume it's considerably more effort to get it working.
Nah, not really. It is a little annoying in terms of space and power, though. Not every case and motherboard can support cards that big.
The real sweet spot for Qwen 27B is getting it on something like a Dual 3090 system or some other config where it can blaze at 50-80 t/s and that costs well under 6K currently. It is a surprisingly capable model. Using something like GLM for orchestration, specs, task farming and then letting Qwen churn is relatively inexpensive.
Overall I recommend people try models of this class out using OpenCode and some for pay service to experiment with them and understand how they work. I find they are very useful.
Long term, I am convinced enough that if I wanted to use local models for any number of reasons I would be okay investing in a dual GPU box. The Mac is not fast enough for me and M5 Max is just too expensive relative to GPU linux box. Still, it is nice to have the models local ON the laptop and it is useful for what I care about locally.
The limited context is problematic. I’m not exactly sure what it’s got available but hermes was hit and miss on a prospecting job.
It does seem to be doing useful work but it’s not API call level quality
Gemma 4 is the only model series at this parameter scale I've seen correctly answer some of these. One of the answers even made me re-evaluate what I thought the correct answer was, which I did not expect.
When I look at the Artificial Analysis numbers, I can see that some things about Qwen 3.6 look inflated as a result of either metrics that weren't measured yet for Gemma 4 31B, or for metrics that just aren't going to be relevant in a lot of the essential tasks. In a lot of the relevant metrics, Gemma 4 is either better or on par.
Then once it's all quantized all those benchmark results will be hurt, and Gemma 4 QAT has better quantized performance. I think it's more competitive unquantized than people give it credit for and way better quantized than people give it credit for.
Qwen 3.6 clearly isn't legitimately bad and maybe it's quite nice at fp16, but it was a disaster quantized in a 24GB scenario by comparison.
If you want to run unquantized, you definitely need 128GB.
Even that isn't strictly necessary - you can get perfectly acceptable performance by splitting a model between multiple older 12 or 16 GB cards.
Edit: it’s not just “data privacy”, when you are using Claude, you are shipping EVERYTHING to Anthropic. It’s crazy.
If the cost doubles, or 4x, which is seems to need to for them to go profitable, what then?
$5000 in US Treasuries (currently at 4.89%) yields $244.5/yr. That's more than enough to cover the annual Claude Pro subscription ($200/yr) which includes Claude Code with lots of Sonnet usage (far better than Qwen 3.6)
Qwen3.6-27B would be faster on a 3090 that costs around $1000-1200 though so I don't think it's a good counter-argument.
Op just happened to have that MacBook, but it doesn't mean it's necessary to run the model.
https://github.com/noonghunna/qwen36-27b-single-3090
Flies though (50-70tps is impressive for a model this smart)
I went through roughly the same process to get it working on my M2 Macbook Pro... at awful speeds of course, since models like this one are mostly bound by memory bandwidth.
I did find a few useful parameter settings I've already discovered using my single 3090 and ollama.
I'm just remarking that the LLMs overwhelm me with minutiae, especially as I'm working on code design. I frequently ask it to restate concisely, and that helps.
[edited to mention ollama as a nice alt]
The 3090's TPD is 350W, but given that LLM's token generation isn't compute bound, people usually undervolt these cards to reduce power consumption. IIRC you can get as low as 200-250W without any degradation. Caveat these figures are without speculative decoding and at batch size =1.
I use my MBP essentially as my workstation, it's almost always plugged in. I have a MBA (M4, 24GB RAM) that I picked up for ~A$1500 or so, and that's an amazing daily driver. I don't do local LLM inference on that unit, I can just hit my own APIs (via LM Studio) on the MBP over Tailscale.
I think you might be a little to into the stew here.
I haven't tried it with https://lemonade-server.ai/ yet but I just might give it a shot.
In the open model space an insane amount of effort goes into getting more powerful models to run with the same or less RAM. For example in the diffusion world many things that could not be run on easily under 24GB of VRAM actually run much better today with much less VRAM than they did a few years ago. You can do many things today with 8-16GB of VRAM that would not have been possible. At the same time the most advanced open models, like LTX 2.3 for video gen, still seem to respect 24GB of VRAM as the upper bound.
Similarly the standard "big" but localish open model for LLMs back in the day was Llama 3 70B, this was both a much worse and much larger model than Qwen 3.6 27B
So in two different spaces I've witnessed the "RAM required to run the best" decreasing or at least remaining stable, while the performance being achieved in both areas is astounding (LTX 2.3 is faster, better and more capable than the Wan 2.2 model that held popularity before it).
The biggest thing to watch out for is not just RAM/VRAM but memory bandwidth. You can try to "future proof" yourself with lots of RAM, but if it's 400 GB/S you're still constrained to smaller models.
at 128GB, you can find almost it's entire context for Qwen3.6 35B MoE.
Again, I think you have too much faith in extrapolation. It's like you got a baby at 0 months, then measured it at 12 months and expect it to be a giant.
... but, the models that WILL run on 128GB (or 64GB or even 32GB) models today are a huge improvement on the best models that would run in the same amount of memory six months ago.
i've watched friends try that route; i've been through this before. taking a downgrade is never fun: if it's a thing you're likely to care about in the future, then sometimes it's better to place yourself in the right ecosystem early.
From what I understand, for a developer, $5000/month is maybe the high end, but $5000/year is fairly standard. (Is that accurate?) So if it pays back in 15 months, that's pretty decent. If it pays back in two months, that's spectacular.
Disclaimer: There's a 35% sale from Alibaba right now. And I'm not accounting for input tokens going faster than output tokens.
Do you know how much VRAM/unified is needed for the 27B model, which is generally regarded as better between the two compared in the article, is needed with little to no KLD loss and at 256k context?
Also, once you worked out how much memory is needed for that, maybe tell us how much a non-Apple system that you can run that (probably similarly or faster) would cost?
And when you have answered that, can you tell us how much privacy costs? Maybe also tell us how private OpenRouter is?
Edit: looking at other replies that are basically pointing out the same thing I did, I guess it's my wording. It's frustrating that people who misinform others in some nicely packaged ways or just simply uninformed get to keep doing that if they sound nice. Thanks.
Ryzen AI Max 395+ with 128GB of unified memory can be found around $3-4k.
But 27B isn't that large, either, especially if you are ok with the quantized models. So this laptop choice seems to more be a "because they had it" rather than "this is what's necessary for this particular workflow"
The real test is whether or not it can work with your existing codebases. In my limited experiments Qwen 3.5 (maybe 3.6 is loads better) does OK on a Rust+React app, and less well on a C# monolith. Not to the point of being unusable but definitely poorly enough that I went back to Claude after 20 minutes. If I lost access to a cloud model and had to use Qwen instead I'd be visibly sad.
Not really germane to your comment but I hope I don’t sound old when I say I remember a time when spinning up a PoC was a week of work, and a statement like yours was pure science fiction.
If I start prompting away the core of a new project I lose interest in the entire thing almost straight away. I hate it. The next day I could care less about it. In fact it just makes me lazy, like a fat person who drives everywhere.
I love typing code and thinking for myself. Im going to continue to do that. I still dont know anyone who's shipped anything truly useful with this garbage tech, let alone with a local 30b param model. So much cope in these comments.
Spending 6k on hardware to run the worlds most mediocre model truly does make you an incredibly stupid person, so Im not really suprised by these comments of people saying these tiny models are helping them so much.
Its like a special needs kid all of sudden got the ability to code, of course they'd be impressed by basically all the code it produces.
I’ve used Qwen 3.6 27B for many things at work, and I’m regularly able use it for reasonably scoped tasks.
I’m not saying these models are perfect.
But you are complaining about people on the extreme, while at the same shouting from the opposite extreme.
This is an underrated consideration when evaluating the small models: The further you deviate from standard example code, the more their weaknesses show.
My experience is that Qwen3.6 produced some amazing results for a small model when I tried it with simple apps that are widely reproduced everywhere. If you want a React TODO app or to set up a little boilerplate app with shadcn and other popular tools, it will produce something that looks not too bad.
Then when I started straying outside of common tasks and into some of my more niche work, it would spin for hours and go in circles before finally producing some groan-inducing output that wasn't usable.
If you're looking for a model to help with simple refactoring or small tasks where you provide very explicit instructions for exactly what you want, but you don't want to do all of the typing yourself, they can do a lot of good work, though. But you're right that once you get into long context sessions involving topics off the beaten path, the weaknesses are very apparent.
The quantizations that are popular for making these models fit on smaller hardware make the problems worse. When you read it about online there is almost a consensus that 4-bit quants are lossless and that you can use q8_0/q8_0 kv cache quantization without any real loss, but in my experience with real projects there's a substantial degradation in long context performance with any of these quants.
Never go below an fp16 kv cache unless you've already tested it in advance with your model on a verified task that you know it can successfully complete. People should also test the difference using the exact same seed value so they can see how the tokens diverge. If you have memory constraints, sometimes you can still use an fp16 kv cache and use storage for an agentic buffer to work your task with mixed abstractions rather than having everything in memory.
For 4-bit weight quants, Gemma 4 31B QAT is where people should be looking instead of Qwen 3.6.
Modifying existing code is way easier if you don't expect it to be smart about it. Don't say "add X feature" and let it explore the codebase and build its own understanding. Point it at the relevant files and say "the goal is to add X feature to this code, follow Y guidelines". Now you've done the hardest part of making the decisions and it just has to follow instructions while coloring within the lines.
Is that not how you would work with any model, local or not? I wouldn't trust it to make the right decisions unattended. I just know the moment I look away it's going to do something utterly braindead.
https://github.com/verdverm/pge-jax
1. Maybe you should tell us what those limited experiments are.
2. Maybe you should actually try 3.6 because it's huge difference in most cases. Don't forget to tell us quants and don't forget to tell us scope.
3. Maybe actually show us data compared to frontier models instead of this... vibe comment. Pretty tired of this kind of comments on HN that doesn't require logic or evidence. Just vibes. Like the pelican riding a bicycle crap that everyone has taken for granted but has no objective way of assessing goodness.
I ran those throu opus saking if it was good advice and was not impressed:
I read the actual qr_scanner.ino. Short answer: partially, but I'd push back on most of it. That review reads like generic ESP boilerplate advice written against an imagined version of your code — several of its "fixes" are already in your file, and its headline "critical" claim misreads what the code does. Going point by point:...
(I'm aware the price is, in absolute terms, more expensive where I live compared to the USA. That reinforces what I think, because anyone sane that would've bought one of those in another country would sell them as soon as they landed here and save that money.)
Imagine its value if war broke out over Taiwan / Greater China, or really any of the dark scenarios with global connectivity or the truthiness of commercially available models. It is a very, very difficult piece of equipment to make at any other moment in history. I wish I could have purchased more. I saw the signs and price trends and out of stocks as they unfolded. No doubt others with the means are stockpiling.
You get fewer tokens per second, but at some point the balance between quality and quantity makes the large model size worth the spend.
When you're spending this kind of money, you may as well treat yourself to a pretty screen and some decent speakers. Nothing the competition doesn't offer these days, but you get them for free with the car-priced RAM upgrade so why go for less.
Yes. Back in the my days at $faang in europe it was not uncommon to hear people getting 120-160 k€/year in compensation and we were “poor” compared to us engineers at the same faang (4-500 k$/year total compensation) with a bit of seniority…
Yes. Your people earn an order of magnitude less income than Americans.
On M5 128GB one can make use of the ram and use sparse MoE. For example, DeepSeek-V4-Flash will fit, served by DwarfStar (https://github.com/antirez/ds4). One will probably improve 2x the token/sec speed, given DS4F 13B activated params in the MoE are ~1/2 of the ~27B of the dense Qwen.
27B Of the Qwen fit even on a cheaper 24GB card, e.g. amd 7900xtx (<$1K?) or slightly dearer nvidia 3090 (with cuda). With ~900 GB/s bandwidth they will likely be ~50% faster than the M5 with 600 GB/s.
I just got a B70 with 32GB RAM for the equivalent of $1200 (incl. sales tax and import duties to my non-US location, so presumably it could be cheaper elsewhere). The memory bandwidth is 608 GB/s. For M5 Max (32-core GPU) it's 460 GB/s and for M5 Max (40-core GPU) it's 614 GB/s. A 3090 is still faster at ~900 GB/s but you're getting 32GB VRAM for a lot less than equivalent Nvidia cards. It's about 1/3 the bandwidth of a 5090 for 1/3 the cost, but with the same 32GB VRAM. If you're interested in being able to run bigger quants with some context and stay on a lower budget then it's an appealing trade off.
I'm still exploring using these local models so don't want to spend the equivalent of $5 000 - $10 000 just to test it out. I don't mind slightly slower perf to do some experimentation more affordably.
I actually got an B50 16GB (with meager 70w TDP!) first to test an Intel card with my stack - it worked easily with Ubuntu & Vulkan. I'd read a lot about hassles and people writing them off as unusable but it seems like these are often with SYCL which doesn't even seem to outperform vulkan and so why bother? (The B50 was just $370 inclusive tax and duties). Literally `apt install` the vulkan libraries and it worked with default xe driver in 26.04 and the vulkan build of llama.cpp. The SR-IOV PF/VF also just works with qemu/kvm, no tricks required. Since I got it fwupdmgr has updated the firmware twice so Intel is presumably actually trying to support these products.
QAT, MTP, 128k context.
I liked Qwen 3.6 27b too, it just seems that Gemma4 is a bit underrated.
It does about 30 tok/s which is enough for me. It's about half what the online models do, but it's enough.
I've heard their 9B models are also good, but they aren't much faster if you have the ram and a nice cpu.
These qwen3.6 models are the first ones I find can do much. GPT OSS was good, and Gemma4 is better. Gemma knows more facts, but qwen3.6 is smarter.
Local development for who? How many of y'all are rocking 128GB of memory? Am I reading Apple's site correctly that it's a $10,000 laptop?
I’m not having it build whole features from scratch, though. I give it pretty explicit instructions closer to the class or function level, and it still saves me an immense amount of time, while I’m very connected to the code that’s written.
Definitely the sweet spot for me.
For 24GB VRAM cards (e.g. 4090) you can use Q6_K (22.5GB) or Q5_K_M (19.5GB) quants, possibly offloading some of the weights to RAM.
Unsloth recommends 18GB of RAM for Qwen3.6-27B (for their version of the model).
https://unsloth.ai/docs/models/qwen3.6
Sent from my 8gb M2 Mac mini.
Progress marches without mercy.
https://github.com/ikawrakow/ik_llama.cpp
Edit: it's gonna be slow if you're not using any VRAM. But it's possible. Software isn't going to speed that up anytime soon, it's just a hardware bandwidth limit.
https://unsloth.ai/docs/models/qwen3.6
Qwen 3.6 27B will run in full offload with a 4-bit quantisation in 64GB on an M1 Max. It is quite slow.
I don't know about 48GB but 64GB should be enough.
It got rather tangled up when I tried it with one of my coding tests, which is a simple wordpress plugin, but I frustrate the model by asking it to write code for older PHP, break WP coding conventions and use a rather bespoke method for arranging code in objects. So it is sort of a hybrid of a green field and brown field task; a bit muddy.
It did not do as well as Qwen 3.6 35B, but the way it worked through its thoughts was interesting.
TBH I struggled to understand what DeepReinforce are doing that is materially different; the explanation of their training technique goes over my head at this point.
So for example I'd favour a used M1 Max over a used M2 Pro, at least based on my naïve understanding. Not quite sure where the balance changes.
There appear to be some hardware improvements with the M3 and up regarding the Apple Neural Engine which I'd hope would show up in MLX performance; I remember seeing some optimisations in image generation models that are only possible on later hardware.
The GPU cores are progressively better I believe, but the memory bandwidth is lower. Though perhaps the M4 can get closer to actually saturating said bandwidth.
(And I must reiterate that my understanding of this stuff is pretty naïve.)
Jackrong has a few different ones available depending on what you're trying to do: https://huggingface.co/Jackrong
For anything else local, including writing some automation scripts and such, it works great.
Was just trying to see how small I could go and get acceptable results, but yeah, larger Qwen 3.6 with MTP is going to be better. Cant wait to see how AI model (unsloth/local-llm/heretic/reaper/etc communities) are tweaking/engineering quality down into smaller models. Lots of new things coming out.
I do not have a crazy rig, a modest gaming one at that, but in trying to understand more about agents and their capabilities, I am SOL with my 16 GB of RAM and 8GB of VRAM. I can get most small, non tool calling models to perform well, but I've had major issues with anything over 9B doing anything more than reasoning (egregiously slow at higher parameter counts).
And so far, I cant get even Pi to extend itself or do any meaningful work with any of the models I currently can get to run.
You should look at gemma-4-26B-A4B. 16+8=24gb and Q4 is about 16GB. Not much context left, but might run.
For you, you could try gemma-4-26B-A4B
I very much appreciate the frank response, as it makes me feel less defeated at knowing my understanding of how it should work is not the full issue, hahaha
But certainly seems like we are a few years away from that, sadly.
Am I also screwed in being able to train my own small model or adjust another one with such a non-workhorse PC?
>
> --jinja for tool calling support
Pretty sure this flag hasn't done anything for a while. It's enabled by default since ~November of last year
How does that work? They have negative GPUs now!
However, text-to-speech, speech-to-text, and non-code LLM use cases are so useful to have local, and don't require big hardware.
Having a universal reliable inference engine interface, I think, is the big unlock that needs to happen before app devs can ship these features.
Personal concrete use case: meeting recording app. This uses Parakeet + Qwen to create local transcriptions and post-cleanup, respectively.
Right now this app has to download and manage all these models, then bundle an inference engine to run them. It's a lot of code that probably should belong to the OS, or at least a standard interface.
While apps can offload some of this to llama.cpp or a similar process over http, that's another set of setup for the user to do before they can have a useful app.
Anyway, if you're getting started on a Mac, I'd suggest trying out oMLX (https://github.com/jundot/omlx) before messing with llama.cpp. In particular they have community benchmarks so you can see what kind of performance you're likely to get: https://omlx.ai/benchmarks. I wished each one had more configuration details though.
Certainly this is falsifiable easily by any of us doing it on a regular basis
> Qwen stuck in thought loops
This does happen when context is not managed effectively; creating plans, using subagents and compactions strategically resolves this
> creating plans, using subagents and compactions
Yes, these are all things that Claude Code does for you. However, for the thought loop issue, these are not the fixes. The canonical fix is to limit the number of thought tokens (llama.cpp's `--reasoning-budget`) or try to mess with the various penalty parameters. In any case, it's not a solved problem as far as I can tell.
don't get me wrong, the frontier models are leaps and bounds ahead of what qwen/kimikgemma are doing - but i don't need to drive a ferrari to the grocery store everytime either.
https://pi-local-coding-bench.dev
I haven't seen anyone make an argument they are as good as SotA (OpenAI, Anthropic). It's just they are approaching state where they are "as good" for some _limited_ set of use cases. Which will allow us to resolve 2 primary issues with these SotA models: privacy and vendor lock-in. Plus, they're very useful for education purposes, you get to explore what things looks like under the hood, play with various models, tools, maybe put something simple together yourself.
You get Macbook - great. You got gaming rig with a decent GPU - great (set it up as a dedicated server that you connect to through simple REST).
What exactly is wrong with any of that?
Personally I prefer the 35B MoE model, which is fast enough to be interactively useful, and capable, but I would probably use the 27B if I wanted to generate whole applications like that.
I am unconvinced that most "local" AI applications need anything much more powerful than the Gemma 4 12B model. Local agentic coding is a small niche, but there are plenty of ways a local model can help with development tasks.
I would really like to see a 12B or 16B Qwen 3.6.
I am currently playing with Ornith 1.0 in the MoE configuration, which is based on the 35B variant of Qwen 3.5; I am not sure if it is better than the 3.6 version.
Benchmarks say it is; my own silly tests either suggest otherwise or suggest that I have to talk to it a bit differently.
I really want to have a model that i can run locally on my 24gb m4 pro mbp for when i don't have internet to connect to my 3090 running the qwen, and i love how gemma 4 models 'feel', but i can't make them be competent. I am in the middle of finetuning both qwen3.5 9B and gemma 4 12B just to try and make those bridge closer to 27B for coding/agentic tasks (and am trying to ternarize and DQT 27B so that it fits in ~9gb pre-KV).
How do you run the gemma? What do you use it for (and in what harness), maybe llama.cpp and pi-mono just aren't for this model and that's what i'm doing wrong.
I am still mostly tinkering/learning rather than spilling out code, and I feel quite slow on it. So it doesn't matter too much to me if it is really slow. More the journey than the destination if that makes sense. I'm stubborn.
I have tried the Gemma 4 12B model (Unsloth's QAT version) with search/browse tools in LM Studio and Unsloth Studio, when I am trying to understand a new thing.
Basically I get it to write introductory starter documentation for me to absorb, because my big personal problem, these days, is focussing enough to start a project and then digging in; I need the help.
I have found its limits on obscure packages (that it sometimes makes up) but before that it's a bit like stumbling on a blog post that happens to be really right for your particular need. Good enough to work through.
It's stuff I could ask Perplexity to do, or ChatGPT, to be fair, I just like LM Studio for this and have the inquisitiveness to want to run it locally.
In your case: I don't believe it's the quant. I'm sure it's the model — it has good coding knowledge but it's clearly not specialised. It might be good enough at writing Python/PHP/JavaScript at a novice level. It is also quite good on WordPress tooling and functions.
But I wouldn't bother with it for agentic coding if you've got experience elsewhere. Might be interesting to see what you can do with the 9B Ornith model?
Qwen 3.6 MoE in its Unsloth version is another matter. Impressive and I am trying to find ways to support my old brain doing what I've done before.
I've seen sites here and there but they feel like quick little toys that don't get updated, so they always suggest old models.
Is there any way to use MLX and GPU at the same time? Or does memory become a big problem?
TBH, I never understood Apple hyping these neural cores because I didn't think anyone actually uses them except maybe certain photo/video editing software.
If I can generate voice at the same time as video, that would be useful.
The neural cores aren't suitable for LLMs/transformers and isn't used in LLM inference. On the M5 and later chips, it comes with neural accelerators, aka Tensor Cores, which speed up the 'prefill' (i.e. processing your context window) part, but don't do anything for inference.
The MLX vs GGUF debate is mostly irrelevant. The GGUF pathways are optimised for apple silicon to the extent of practically identical performance to MLX. MLX is just one way of using Apple GPUs, it comes with many optimisations in the box, but they're not hard and they're no longer MLX-exclusive.
tweaking sampler might help
I'm running the NVFP4 alongside Gemma4 at the same quant on an OEM Spark
also i like that if i drop more sophisticated tools into my harness (e.g. any of the NLP/RAG-based search tools in place of grep/rg), the agent will actually reach for them and make progress faster; previous models have been reluctant to embrace new tools.
Lora if effective could be a great reason to run local models.
Qwen on the other hand got straight to work with astonishing competency on the same system.
From what I read llama3 needs beefier compute to reliably invoke tools, which I presume relates to it focussing more on simulating AGI rather than being a useful tool.
https://arena.ai/leaderboard/code/webdev/pareto?license=open...
https://arena.ai/leaderboard/text/pareto?license=open-source
200k @ K : Q5_0 V: 4_1 (which is a bit of a sweet spot)