ES version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
84% Positive
Analyzed from 3108 words in the discussion.
Trending Topics
#models#model#local#apple#more#core#https#run#still#qwen

Discussion (105 Comments)Read Original on HackerNews
but i maintain https://github.com/Arthur-Ficial/apfel so i might be biased
Here's what you get when you run it... https://gist.github.com/robgough/7893602895e7580117475076198...
but i definitely feel flattered, either my little project inspired them or that I reached the same conclusion at a similar time as a team at apple that "hey, this is totally missing"
chat completion is openai's api surface name.
but only when it is actually available we will see if it's a clean drop-in vs. just "chat-completions-ish".
one of my learnings from apfel is that is is very easy to get a kinda openAI api compatible server, and a lot of work to get it really totally compatible. sometimes i wonder if even the openai implementation of openai's api is openai api compatible to the core....
I have been fairly much pissed off at the "hype in hyperscaler" AI growth (data center environmental and other societal costs) and I support anything we can do to promote local and private AI.
I also really want to hear more about their containerisation/seatbelt strategy now that they are offering MCP support. Not seen any news about Darwin inside their containers system.
(Apfel is a cool project; it’s been the only thing tempting me to upgrade to Tahoe)
Meet Core AI - https://developer.apple.com/videos/play/wwdc2026/324/
Dive into Core AI model authoring and optimization - https://developer.apple.com/videos/play/wwdc2026/325/
Integrate on-device AI models into your app using Core AI - https://developer.apple.com/videos/play/wwdc2026/326/
Does this completely replace the previous API, CoreML? [1]
"If your app uses model types other than neural networks, such as decision trees or tabular feature engineering, see Core ML."
- https://maderix.substack.com/p/inside-the-m4-apple-neural-en...
- https://news.ycombinator.com/item?id=47257931
- Core ML narrows to classic, non-neural ML (its own docs now point you there for "decision trees or tabular feature engineering")
- Core AI takes neural nets and transformers (the new .aimodel format, the new profiler)
- MLX stays the separate bring-your-own-weights track (its WWDC sessions draw no line back to Core AI at all)
coreai-opt is the successor to coremltools on the optimization side.
- Core ML is for models designed only for Apple platforms
- MLX is for models that don't need to be fast
- Core AI is for models that run everywhere already and also need to be fast
MLX is not for end user deployment.
https://developer.apple.com/private-cloud-compute/
Apple keeps MLX (bring-your-own-weights) separate from Foundation Models / Core AI.
The AI future will be clearly... what it will be. Probably bouncing back and forth from local to centralized. However, if there are money to be made by selling things that people run locally, it seems that centralizing creates more power and hence more money.
It doesn't matter how good the model is if it doesn't have context from data sources.
I've been on claude's opus 4.5/6/7 for work for a couple months, and I finally got back to running Qwen A3B 35B... it's incredibly performant and quite capable on semi-reasonable local hardware.
I get ~150 tokens/s on dual nvidia RTX 3090s and can fit the whole 300k context into gpu on a UD-Q4-K-XL quant gguf.
Combined with Pi as a harness, and I'm surprised to find that it feels about as capable as claude did 8 months ago (their 3.x models).
It's not Opus 4.5 levels yet, but it's good enough for a LOT of basic work. I actually downgraded my personal anthropic subscription because Qwen is absolutely fine for implementation work. I still let a better model write a plan, but then I can just switch over to Qwen to implement.
I don't think we're 10 years away from opus 4.5 levels running on cheap consumer hardware. I think we're probably closer to 18 months away, and I suspect it'll be in the 30-60b range, not the 256b range.
PC manufacturers also seem to be betting on local, with a LOT of focus on 64 to 128gb unified RAM machines.
I am a fully-burned-out freelancer (in the last couple of years so severely and totally that I thought I had early onset dementia, and I am still not sure I don't). I don't really have an off-ramp to anything else yet, but the sea-change in the industry has been contributing to my feeling that I should knock it on the head.
I must get past broad understanding of AI to deep understanding, but I have to find a way to do this which sits well with freelancer ethics (sustainability, stability, control of destiny).
So I decided I would start out with that operating principle that ultimately this stuff is just going to be local: models will eventually hit some level of practicality for most tasks and technological progress guarantees that they will eventually run on desktops.
I decided to learn how to run models locally properly, see how far I get with opencode (and Pi and Zed experiments), and grow outwards from there to metered models (opencode go, openrouter etc.)
Knowledge first; what can I do that meaningfully changes my outcomes and confidence with no cost and no exposure to sudden change?
I have a secondhand M1 Max (excellent GPU bandwidth), and I am really shocked to find that arguably that level of practicality is already here.
Qwen 3.6 35B can really do a lot. And — not sure if you have tested it — but in some ways I think the Gemma 4 26B is better. Particularly for more commonplace dev tech — it is very knowledgeable about the sort of low-end web dev stack that is most common (Wordpress, PHP, MySQL).
I have been getting 75 tokens/sec with (GGUF) Gemma-4 26B QAT and MTP. (Can't get anywhere close with MLX, for some reason.)
A similar sort of speed with an MLX Qwen 3.6 35B. I have a sneaking suspicion that maybe llama.cpp is now faster than MLX on this older kit so I might try seeing what llama.cpp can do there, too.
Not blazing fast, but fast enough that there are plenty of experiments and small jobs I can do before I even get to using Big Pickle!
Local is a pipe dream . If you can run it cheap occasionally why commercial companies can’t run it cheaper 24/7 and lower the costs ? The answer is simple. Use cases are more demanding and hence you need more from model not less .
Sure if you task is to do a narrow labeling task on 1m records small optimized model is good . If you want to do complex things , it shifts with models advancements
Qwen 3.5 was released 3/2/2026. It includes models up to a 397B-A17B model
https://huggingface.co/collections/Qwen/qwen35
A day afterwards, a high-up technical leader working on Qwen was let go
https://techcrunch.com/2026/03/03/alibabas-qwen-tech-lead-st...
The more recent Qwen 3.6 was released on 4/16
https://huggingface.co/collections/Qwen/qwen36
This does not include any particularly large models. But the models it contains (Qwen3.6 27B and Qwen3.6 35B-A3B) are the local models people have been very excited about lately. So they didn't release any larger models, and the models people praise so much are from this most recent release.
I think what people didn't realize was, just because the GPT-4.5 model didn't get better on the benchmarks, didn't mean the model wasn't different than the earlier models. It was being compared to thinking models that were being developed at the same time.
The GPT 4.5 model still has some of the most "human" like abilities in communication even though it isn't particularly good a problem solving. It hadn't under gone the same type of reinforcement training.
I still use GPT 4.5 sometimes, in creative exercises it can be surprisingly effective. The model is still available.
I use small models exclusively. They aren't a replacement for large models. You need decent hardware to run those models efficiently, as smaller parameter models plain suck and are still slow on macbooks. And affordability of higher end hardware is very limited.
Even at non VC subsidized $/token prices, its still much cheaper to run cloud based models.
On a price-per-wattage level, this is not true, people have done the math on /r/LocalLLaMA many times over[1]. Local models, while not as good as premier models (GPT 5.5, etc.), are like ~80%+ of the way there, and often converge to a similar solution after a few dead ends.
[1] https://www.reddit.com/r/LocalLLM/comments/1kshq4f/electrici...
Where do these improvement curves go? Does the gap close, do they intersect for practical purposes (factoring in cost etc)? Or is the local curve always just a translation of the hosted, lagging behind, or indeed does hosted just pull ahead?
Nobody knows, but it's a very open question I feel, and it certainly appears like the answer might quite reasonably be that yes they intersect on that kind of short-ish term time horizon.
Nowhere.
Large models haven't seen that much improvement, just small unique tasks performance which is all special cased RLed to game metrics
For local models, its the same story. You can download Gemma 3 QAT from last year, and it will be just as good as Gemma:31b on the average. Qwen also boasts that its better, because again, they RLed it to game some metrics. Its better in coding then Gemma, but Gemma is better in more creative thinking (again, all RL)
Fundamentally, you need detail in the gradients for the models to pick up on the smaller details. If you don't have those, your output is gonna suck. No amount of clever architecture is going to fix this.
The only way to improve local models by training them to fetch context, and then their job becomes much simpler because all they need to do is reinterpret the fetched content and provide an answer. But fundamentally, if you are trying to keep things in house for advertising purposes like what all companies do with search, you want them to go to your service, which means running on your servers. And its not really that much extra per invocation (i.e excluding initial hardware costs) to instead just offer a large model as a service, which will be way better than any small models.
I expect I'll probably keep paying for whatever badass high IQ model is running on inference servers at that point
I haven't seen any sign that the framework fragmentation problem is going away anytime soon. NVIDIA wants everyone to do all training and inference with CUDA and to deny that NPUs have any usefulness. Everybody making an NPU has a different framework tailored to their architecture and the limitations they inherited from hardware designed before LLMs existed, and most of them have a another framework for targeting a GPU. And the OS vendor has one or two frameworks they would prefer you use rather than something hardware-specific.