DeepSeek 4 Flash local inference engine for Metal

152

ttamnd about 4 hours ago 49 commentsRead Article on github.com

HI version is available. Content is displayed in original English for accuracy.

⚡ Community Insights

Discussion Sentiment

68% Positive

Analyzed from 1521 words in the discussion.

Discussion (49 Comments)Read Original on HackerNews

kgeist•about 1 hour ago

Heh, I made something very similar for the Qwen3 models a while back. It only runs Qwen3, supports only some quants, loads from GGUF, and has inference optimized by Claude (in a loop). The whole thing is compact (just a couple of files) and easy to reason about. I made it for my students so they could tinker with it and learn (add different decoding strategies, add abliteration, etc.). Popular frameworks are large, complex, and harder to hack on, while educational projects usually focus on something outdated like GPT-2.

Even though the project was meant to be educational, it gave me an idea I can't get out of my head: what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination? GPUs are expensive and harder to get with each day. If you remove enough abstractions and code directly to the exact hardware/model, you can probably optimize things quite a lot (I hope). Maybe run an agent which tries to optimize inference in a loop (like autoresearch), empirically testing speed/quality.

The only problem with this is that once a model becomes outdated, you have to do it all again from scratch.

xtracto•10 minutes ago

This takes me to the famous FizzBuzz High performance codegolf answer [1]. If we could implement optimizations like that for the inferences, maybe we could increase the speeds 10x or more.

[1] https://codegolf.stackexchange.com/questions/215216/high-thr...

mirsadm•21 minutes ago

I've built something like this. One issue is that LLMs are actually terrible at writing good shaders. I've spent way too much time trying to get them not to be so awful at it.

joshmarlow•about 1 hour ago

Another suggestion for optimizing local inference - the Hermes team talks a lot on X about how much better results are when you use custom parsers tuned to the nuances of each model. Some models might like to use a trailing `,` in JSON output, some don't - so if your parser can handle the quirks of the specific model, then you get higher-performing functionality.

brcmthrowaway•2 minutes ago

How does this compare with oMLX?

antirez•about 2 hours ago

A random, funny, interesting and telling data point: my MacBook M3 Max while DS4 is generating tokens at full speed peaks 50W of energy usage...

losvedir•4 minutes ago

It's so interesting to think about how much power it takes these machines to "think". I think I had a vague notion that it was "a lot" but it's good to put a number on it.

If DS4 Flash peaks at 50W and is 280B parameters, does that mean DS4 Pro at 1.6T parameters would likely be 300W or so? And the latest GPT 5 and Opus which feel maybe comparable-ish around 500W? Is it fair to say that when I'm using Claude Code and it's "autofellating" or whatever I'm burning 500W in a datacenter somewhere during that time?

minimaxir•about 2 hours ago

"Data centers for LLMs are technically more energy efficient per-user than self-hosting LLM models due to economies-of-scale" is a data point the internet isn't ready for.

cortesoft•21 minutes ago

I thought this is a pretty generally accepted fact?

Onavo•about 2 hours ago

There's a bunch of companies doing garage GPU datacenters now. Probably can act as a heat source during winter too if you have a heat pump.

Lalabadie•about 1 hour ago

Using only this dimension in a vacuum, it sounds like an easy choice, but we're extremely early in this market, and the big providers are already a mess of pricing choices, pricing changes, and sudden quota adjustments for consumers.

Plus, a Mac that's not running inference idles down to 1-5W, only drawing power when it needs to. Datacenters must maximize usage, individuals and their devices don't have to.

A Mac is also the rest of the personal computer!

j_maffe•20 minutes ago

But it's simply an economic fact that EoS will be more efficient with a task that's so easy to offload somewhere else.

jwr•40 minutes ago

Not everybody might realize this, but this is a truly excellent and very impressive result. Most models on my M4 Max run at 150W consumption.

bertili•about 2 hours ago

equals 2 or 3 human brains in power usage. Amazing work!

antirez•about 2 hours ago

True quantitatively, not qualitatively. DeepSeek V4 is not capable of doing what a human brain can do, of course, but for the tasks it can do, it can do it at a speed which is completely impossible for a human, so comparing the two requires some normalization for speed.

scotty79•about 1 hour ago

I'm sure human brain, at least my present brain, is incapable of many things DeepSeek V4 can do. Qualitatively.

Hamuko•about 2 hours ago

I think I’ve seen about 60 watt total system whenever I’ve used a local model on a MacBook Pro or a Mac Studio. Baseline for the Mac Studio is like 10 W and like 6 W for the MacBook Pro.

maherbeg•about 4 hours ago

This is so sick. I'm really curious to see what focused effort on optimizing a single open source model can look like over many months. Not only on the inference serving side, but also on the harness optimization side and building custom workflows to narrow the gap between things frontier models can infer and deduce and what open source models natively lack due to size, training etc.

dakolli•about 3 hours ago

There will always be a huge gap between frontier models and open source models (unless you're very rich). This whole industry makes no sense, everyone is ignoring the unit economics. It cost 20k a month to running Kimi 2.6 at decent tok/ps, to sell those tokens at a profit you'd need your hardware costs to be less 1k a month.

Everyone who's betting their competency on the generosity of billionaires selling tokens for 1/10-1/20th of the cost, or a delusional future where capable OS models fit on consumer grade hardware are actually cooked.

bensyverson•about 2 hours ago

If you looked at a graph of GPU power in consumer hardware and model capability per billion parameters over time, it seems inevitable that in the next few years a "good enough" model will run on entry-level hardware.

Of course there will always be larger flagship models, but if you can count on decent on-device inference, it materially changes what you can build.

physicsguy•about 2 hours ago

It also massively changes the value economics of the frontier models. In a lot of cases, you really don't need a general purpose intelligence model too.

dakolli•about 2 hours ago

No offense, this is a crazy delusional statement.

liuliu•about 2 hours ago

I am not sure where this comment is from (possibly without looking at this project?). This project is running quasi-frontier model at reasonable tps (~30) with reasonable prefill performance (~500tps) with a high-end laptop. People simply project what they see from this project to what you optimistically can expect.

You can argue whether the projection is too optimistic or not, but this project definitely made me a little bit optimistic on that end.

amunozo•about 2 hours ago

Most tasks do not require frontier models, so as long as these models cover 95-99 per cent of the tasks, closed frontier models can be left for niche and specialized cases that are harder.

dakolli•44 minutes ago

Frontier models can hardly do the tasks I want them too, I simply cannot buy into this notion.

otabdeveloper4•about 2 hours ago

> a delusional future where capable OS models fit on consumer grade hardware

48 gb is enough for a capable LLM.

Doing that on consumer grade hardware is entirely possible. The bottleneck is CUDA and other intellectual property moats.

amunozo•about 2 hours ago

I am curious about it producing less tokens except for the max mode. I love DeepSeek V4 Flash and I use it extensively, it's so cheap I can use it all day and still not use all my 10$ OpenCode Go subscription. I use it always in max mode because of this, but now I wonder whether I should rather use high.

PhilippGille•about 1 hour ago

On max it uses more than twice as many tokens as on high when running the ArtificialAnalysis benchmark suite, and then it's indeed the model with the highest token usage (among the current top tier models). See the "Intelligence vs. Token Use" chart here:

https://artificialanalysis.ai/models?models=gpt-5-5%2Cgpt-5-...

amunozo•36 minutes ago

Wow, the difference is quite considerable and the gain in intelligence is not that much. I might try to use high and just iterate more often. I am working with hobby stuff so I don't have to worry whether it breaks things or not.

unshavedyak•about 2 hours ago

What do you use it for? I tend to just stick to SOTA (Claude 4.7 Max thinking), and put up with the slow req/response. I'm not sure what type of work i'd trust a less thinking model, as my intuition is built around what Claude vSOTA Max can handle.

Nonetheless eventually i want to build an at-home system. I imagine some smaller local model could handle metadata assignment quite well.

edit: Though TIL Mac Studio doesn't offer 512GB anymore... DRAM shortage lol. Rough.

amunozo•40 minutes ago

I am experimenting with some game development and my thesis' beamer. I have a 20$ Codex account and I use GPT-5.5 for planning and DeepSeek for executing in OpenCode. This makes my Codex 5h tokens to last more than 10 minutes.

syntaxing•about 2 hours ago

How has opencode go been for you? Worth changing over from Claude pro?

amunozo•38 minutes ago

Given the price, extremely satisfied, especially thanks to DeepSeek V4 Flash that makes it last forever. I use it on top of my 20$ Codex which is great but tokens last nothing.

DefineOutside•about 1 hour ago

I've found that opencode and codex are the two subscriptions that still seem to subsize usage. Deepseek V4 has been the most powerful model in opencode IMO, I trust it with problems where I can validate the solution such as debugging an issue - but I only trust the proprietary GPT-5.5 and Claude Opus 4.7 models for writing code that matters.

visarga•about 1 hour ago

Large LLMs on MacBook produce tokens at an acceptable speed but the problem is reading context. Not incremental reading like when you have a chat session, because they use KV cache, but large size reading, like when you paste a big file. It can take minutes.

antirez•37 minutes ago

DS4 can process 460 prompt tokens per second. Not stellar but not so slow. On M3 max. See the benchmarks on readme.

bel8•about 1 hour ago

And unless I'm mistaken, the repo is about running it with 2bit quantization.

This is probably far from the raw intelligence provided by cloud providers.

Still, this shines more light on local LLMs for agentic workflows.

sourcecodeplz•about 1 hour ago

Great project!

This is also a fine example of a vibe-coded project with purpose, as you acknowledged.

nazgulsenpai•about 1 hour ago

I keep seeing DS4 and in order my brain interprets it as Dark Souls 4 (sadface), DualShock 4, Deep Seek 4.

happyPersonR•about 2 hours ago

So just gonna ask a question, probably will get downvoted

I know this is flash, but….

But other than this guy, did our whole society seriously never flamegraph this stuff before we started requesting nuclear reactors colocated at data centers and like more than 10% of gdp?

Someone needs to answer because this isn’t even a m4 or m5… WHAT THE FUCK

Sidenote: shout out antirez love my redis :)

AlotOfReading•about 2 hours ago

This is built atop a tower of stuff people built with profiling and performance-oriented design.

That said, I've found that most corporate environments are unintentionally hostile to this kind of optimization work. It's hard to justify until the work is already done. That means you often need people with the skills, means, and motivation to do this that are outside normal corporate constraints. There aren't many of those.

happyPersonR•about 1 hour ago

Building this into agentic dev workflows (subject to token/time constraints) is something I spent a lot of time doing at work. I actually am kind of proud of that hahah

But you’re right I agree

In the corporate world they sadly don’t take kindly to performance profiling as a first class citizen

Granted I will say optimization without requirements may not be beneficial but at least profiling itself seems worthy if you have use cases.

A lot of us have been working in the network packet pusher software , distributed systems , distributed storage space

I’m happy to see more stuff like this :)

TLDR; I’ve not seen a lot of flamegraphs of Llm end to end … idk if anyone else has?

wmf•about 1 hour ago

Every lab has a bunch of people doing nothing but optimizing.

liuliu•about 2 hours ago

DSv4 generates much faster on NVIDIA class hardware. It is just a very efficient model.

fgfarben•about 2 hours ago

The world is not China.