ES version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
68% Positive
Analyzed from 1521 words in the discussion.
Trending Topics
#models#model#more#tokens#project#hardware#max#inference#lot#frontier

Discussion (49 Comments)Read Original on HackerNews
Even though the project was meant to be educational, it gave me an idea I can't get out of my head: what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination? GPUs are expensive and harder to get with each day. If you remove enough abstractions and code directly to the exact hardware/model, you can probably optimize things quite a lot (I hope). Maybe run an agent which tries to optimize inference in a loop (like autoresearch), empirically testing speed/quality.
The only problem with this is that once a model becomes outdated, you have to do it all again from scratch.
[1] https://codegolf.stackexchange.com/questions/215216/high-thr...
If DS4 Flash peaks at 50W and is 280B parameters, does that mean DS4 Pro at 1.6T parameters would likely be 300W or so? And the latest GPT 5 and Opus which feel maybe comparable-ish around 500W? Is it fair to say that when I'm using Claude Code and it's "autofellating" or whatever I'm burning 500W in a datacenter somewhere during that time?
Plus, a Mac that's not running inference idles down to 1-5W, only drawing power when it needs to. Datacenters must maximize usage, individuals and their devices don't have to.
A Mac is also the rest of the personal computer!
Everyone who's betting their competency on the generosity of billionaires selling tokens for 1/10-1/20th of the cost, or a delusional future where capable OS models fit on consumer grade hardware are actually cooked.
Of course there will always be larger flagship models, but if you can count on decent on-device inference, it materially changes what you can build.
You can argue whether the projection is too optimistic or not, but this project definitely made me a little bit optimistic on that end.
48 gb is enough for a capable LLM.
Doing that on consumer grade hardware is entirely possible. The bottleneck is CUDA and other intellectual property moats.
https://artificialanalysis.ai/models?models=gpt-5-5%2Cgpt-5-...
Nonetheless eventually i want to build an at-home system. I imagine some smaller local model could handle metadata assignment quite well.
edit: Though TIL Mac Studio doesn't offer 512GB anymore... DRAM shortage lol. Rough.
This is probably far from the raw intelligence provided by cloud providers.
Still, this shines more light on local LLMs for agentic workflows.
This is also a fine example of a vibe-coded project with purpose, as you acknowledged.
I know this is flash, but….
But other than this guy, did our whole society seriously never flamegraph this stuff before we started requesting nuclear reactors colocated at data centers and like more than 10% of gdp?
Someone needs to answer because this isn’t even a m4 or m5… WHAT THE FUCK
Sidenote: shout out antirez love my redis :)
That said, I've found that most corporate environments are unintentionally hostile to this kind of optimization work. It's hard to justify until the work is already done. That means you often need people with the skills, means, and motivation to do this that are outside normal corporate constraints. There aren't many of those.
But you’re right I agree
In the corporate world they sadly don’t take kindly to performance profiling as a first class citizen
Granted I will say optimization without requirements may not be beneficial but at least profiling itself seems worthy if you have use cases.
A lot of us have been working in the network packet pusher software , distributed systems , distributed storage space
I’m happy to see more stuff like this :)
TLDR; I’ve not seen a lot of flamegraphs of Llm end to end … idk if anyone else has?