FR version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
69% Positive
Analyzed from 5714 words in the discussion.
Trending Topics
#models#granite#model#llm#article#qwen#more#https#don#training

Discussion (202 Comments)Read Original on HackerNews
Qwen3.6 35b a3b is still my local champion but I may use this for auto complete and small tasks. Granite has recent training data which is nice. If the other small models got fine tuned on recent data I don't know if I would use this at all, but that alone makes it pretty decent.
The 4b they released was not good for my needs but could probably handle tool calls or something
I second this! Using the Unsloth Q6 (I forgot the exact name). Currently using it with forgecode (with zsh), on my Strix Halo, and it's suprisingly really good. I would say slightly Similar to Haiku 4.5, plus additional privacy, minus speed. It's surprisingly really fast for the hardware, given the speculative decoding, still PP is on the slow side.
Can you share some parameters you enable tool calling and agentic usage?
Or, higher level, some philosophies on what approaches you are using for tuning to get better tool calling and/or agentic usage?
I'm having surprisingly good success with unsloth/Qwen3.6-27B-GGUF:Q4_K_M (love unsloth guys) on my RTX3090/24GB using opencode as the orchestrator.
It concocts some misleading paths, but the code often compiles, and I consider that a victory.
You have to watch it like you would watch a 14 year old boy who says he is doing his homework but you hear the sound effects of explosions.
The difference basically boils down to Gemma 4 making more assumptions and Qwen 3.6 sticking closer to the prompt, if your prompt is bad or leaves things up to the imagination, Gemma will do a better job, if you need strict prompt adherence Qwen is better. Since local models are "dumb" i think it makes sense to prefer prompt adherence, but there are complex tasks that Gemma will complete much much faster than Qwen because it makes the right assumptions the first time and as a result even with slower inference requires way fewer turns.
My speculation is that this comes from google having a much better strategy for filtering their training data, I think this also shows up in the shape of the world knowledge of the models. Gemma's world knowledge seems deeper even though the models are of roughly equivalent size to the Qwen counterparts so it's mostly likely just concentrated in places that are more relevant to my queries.
Most notably in my testing, Gemma 4 31b is the ONLY local model that will tell me the significance of 1738 correctly. Even most flagship/cloud models answer with some hallucinatory nonsense.
Now, is this the usual use case? No, it's a benchmark I created specifically in order to put LLMs in situations where they can't just blast out their bash commands without having to interface with something else and adapt.
The Qwen models are quite solid though.
Maybe it could be fun to hook them up via a2a protocol as left and right brain agents operating in tandem.
Not for creative writing or NLP.
I have not benchmarked Qwen3.5 vs. Qwen3.6 for the same task, nor trialed Gemma4-26B. Guess it's time for some testing!
The 4b was okay. It didn't get all of my small math questions right, it didn't know about some of the libraries I use, but it was able to do some basic auto complete type stuff. For microscopic models I like the llama 3.2 3b more right now for what I do, it's a little faster and seems a little stronger for what I do. But everyone is different and I don't think I'll use it anymore this past month has been crazy for local model releases.
curious how people are leveraging these models
Using an 8B LLM for auto complete seems kind of like overkill. Couldn't a much smaller model handle that? IIRC there's a Qwen 1B model.
Original article on IBM research
Hugging face weights: https://huggingface.co/collections/ibm-granite/granite-41-la...
https://huggingface.co/ibm-granite/granite-speech-4.1-2b
designed for multilingual automatic speech recognition (ASR) and bidirectional automatic speech translation (AST) for English, French, German, Spanish, Portuguese and Japanese.
Training purpose-specific miniature models lets you have a lot of tasks you can run with high confidence on consumer hardware.
I don’t know how many difference little models this uses under the hood, but I was shocked at how good it was at the couple document extraction tasks I threw it at.
Regardless, the people in the 80s capable of pruning programs to fit on small devices is likely happening now. I'd bet most of the Chinese firms are doing it because of the US's silly GPU games among other constraints.
If costs are high, they might reserve a certain percentage for big business at market prices (or just under) to cover the chip's mask costs.
After DDR5+ RAM, then GDDR5-6 RAM for use with AI accelerators. They might try to jump right in on a HBM alternative. That could be the percentage for AI buyers I just mentioned. Especially if they could put 40-80GB on accelerators like Intel ARC's.
If successful enough, they license MIPS' gaming GPU's to combine with this stuff with full, open-source stack and RTOS support for military sales.
- A lot of people suggesting llama-server's web ui, but that requires you use local AI (llama.cpp), it's persisting content into your browser rather than the server (so you can lose your chats), and it doesn't support much functionality.
- There are some pure-browser chat interfaces that are like llama-server but you can use remote LLMs. This is closer to what you want, but everything is stored in the browser, so backup is harder.
- There's LocalAI, which is like the llama-server option, but more stuff is built in and it persists data to disk. It's flashy and very easy if all you want to do is local AI.
- There's LM Studio, which is another thing like LocalAI, but a desktop app.
- There's OpenWebUI, where it's like LocalAI, except you don't do local inference, you use remote LLMs. It sucks to be honest, just stops working a lot of the time, UX is terrible, lots of weird bugs.
- There's OpenHands, which is more like Codex/Claude Code web UI. You run it locally and connect to remote LLMs. Kinda clunky, limited, poor design. Like most coding agents, it doesn't support all the features you would want, like LocalAI/OpenWebUI do.
- There's OpenCode's web UI, which is like OpenHands, but less crappy.
- There's Jan, which is probably what you want. It's a desktop app rather than a web UI.
Unfortunately it is pretty buggy, so I am maintaining a fork matching my personal needs with bugfixes and a few extra features.
LM Studio is nice in that it makes it easy to add tools, like search. Qwen 3.6 is such a small model that it lacks a lot of knowledge of the world (so it can hallucinate at an uncomfortable rate, which is a common failure mode of very small models), but it can use tools, so being able to search lets it research before answering. It has pretty good reasoning and tool calling, so it's actually pretty effective. I've been comparing Gemma 4 (31B at 8-bits, also very good with tools and reasoning for its size), Qwen 3.6 (27B at 8-bits), against Claude Opus and Gemini Pro lately. And, obviously the frontiers are better, but most of the time, I find the tiny models are fine. I'm still not quite at the point where I'd be willing to code with local models, as the time wasted on hallucinations and logic bugs and sloppy coding practices are much higher, as is the cost of security bugs that make it past review.
* https://docs.ollama.com/integrations/claude-code
Quick vibe check of it- 8B @ Q6 - seems promising. Bit of a clinical tone, but can see that being useful for data processing and similar. You don't really want a LLM that spams you with emojis sometimes...
But yea dislike that style where each heading and bullet point gets an emoji
The article makes some good points about model design (how different size models within a family can get similar results, how to filter out hallucination, math result reinforcement), so that's worth understanding. It's analyzing a paper, which only discussed 3 sizes of the same model family. But what the article doesn't say is, compared to other model families, Granite 4.1 8B sucks. The only benchmark it does well at compared to other models is non-hallucination and instruction following. Qwen 3.5 4B (among other models) easily outclass it on every other metric.
This article teaches a valuable lesson about reading articles in general. You can take useful information away from them (yes, despite being written by LLM). But you should also use critical thinking skills and be proactive to see if the article missed anything you might find relevant.
I'm using Gemini 3.1 pro to help me research my thesis, it still with search enabled and on pro mode, invents entire papers that don't exist, and lies about the contents of existing papers to relate them to the context or to appease me, if I submitted an LLM written article based on the results its given me 80% of the article would be lies
Commenting to complain that the article is LLM written is helpful too since some people aren't able to distinguish
The exact same thing is true of Human speech. You have no idea if anything a human says is true until you fact check it. But you don't fact check everything every person says, do you?
So what do you do instead? You use heuristics. Simple - and quite flawed - subconscious rules to stop worrying about things. You find a person you like, and you classify them "trustworthy", and believe almost all of what they say, not considering if any of it might be false. But of course, humans are fallible, and many of them receive "poisoned" input, and even hallucinate (making up information). They then spread that false information around. Yes, even the people you trust.
And when you're faced with something untrue, said by someone you trust, you rationalize it. "Oh, they just made a mistake." And you completely ignore that the person you trust told you a falsehood. Life is hard enough without having to question if everything we hear is false. So we just accept falsehoods from some people, and not others.
LLMs are likely more factual and knowledgeable today than humans are, thanks to their constant improvements via reinforcement. They're going to keep getting better too. But they'll never be perfect. Rather than rejecting anything they produce, my suggestion would be to do what you do with humans: trust them a little, verify big things, let the little things go, accept that there will be errors, and move on with life.
For sparse knowledge tasks, where you know that the model can't possibly have much training because even humans themselves don't have much knowledge there, use it as a brainstorming partner, not as a source. Or put relevant papers in it's context to help you eval those papers in relation to your work. But it's just going to hurt itself in confusion trying to tie fuzzy ideas to sparse sources embedded in pages upon pages of mildly related google search results.
Anti-AI people like to bring up hallucination as if everything AI generates is false.
I can write pages of text, with my own content, and then use AI to improve my writing and clarity. Then I review and edit. It might have some LLM markers in there, which I remove sometimes because it's distracting. But the final, AI assisted writing is easier to read and better organized. But all the ideas are mine. Hallucinations are not remotely a problem in this case.
You're complaining about facts that have been true since words have been written on paper. If you read the article with the same criticality you read any other article you wont have the problem you complain about.
The reality is, you're only complaining because you hate ai. Cool, but dont dress it up and resort to name calling to browbeat the other guy
If it has AI tells then I wont bother to continue reading because it was either written by an AI or it was written by someone who can't tell the difference.
Either way it's probably a poor piece of writing.
I think instruction following is going to be the most useful thing these models do. Add a voice interface and access to a bunch of simple, straight-forward devices or APIs and you have a mildly useful assistant. If that can be done in 8B parameters it will soon run on edge devices. That's solid usefulness.
It's mind-boggling how bad current voice assistants sometimes are when you prompt them some fairly easy questions.
Maybe my point is something on the lines of "Just send me the prompt"[0]
[0] https://blog.gpkb.org/posts/just-send-me-the-prompt/
1) articles generated with context data that's trivial to find (or even embedded into the model)
2) articles generated with context data that's hard to find or not publicly available
But how can I tell if those are good points or not?
I don't want to invest time in reading something if the presence of those "good points" depends on a roll of the dice.
The problem is that in the past it took multiple times more effort and hours to write something than it took to read. That served two purposes:
1. Lazy people just looking for an audience were effectively gatekept from drowning the world with their every vapid thought.
2. Because supply was many times slower than consumption it was viable to give most articles a chance: the author could not drown me in a deluge even if they wanted to.
Having the criteria now that the author should spend at least as much effort creating the piece as they expect the reader expend reading it is a damn useful bar: instead of reading 1000 AI articles just to find the one good one, I can simply read 10 human authored articles and be certain that 9 of them have something worthwhile.
No, they aren't.
You are comparing writing produced with little to no effort to writing produced with the minimal effort required to communicate.
It's reasonable for people to complain that they are presented material that not even the author thought was worth the effort.
I already assume some comments here are LLM written.
I assume some people here have never programmed a single useful thing even once in their lives.
Right. This just says that Granite 4.1 8B is better than a previous version, Granite 4.0-H-Small, which has 32B, 9B active.
So, they made a less bad model than before. But that doesn't tell you anything about how it compares with other models.
I'm not sure it's proud as much as people voicing displeasure with the uncertainty about what went into the LLM prompt. This may have been a 1 sentence prompt, or it may have been some well researched background that simply reformatted it. Why waste minutes-hours on verifying it if it's possible someone could have spent 10 second on it? It's very easy to see their point.
People seem to indicate people they disagree with voicing their opinion about anything lately is some auto-fellatio, I wonder what causes them to think this way.
Why people don't edit out obvious sloppification and expect to still have readers left
I hear this sort of thing all the time now on YouTube from media/news personalities:
“And that’s the part nobody seems to be talking about.”
"And here's what keeps me up at night."
“This is where the story gets complicated.”
“Here’s the piece that doesn’t quite fit.”
“And this is where the usual explanation starts to break down.”
“Here’s what I can’t stop thinking about.”
“The part that should worry us is not the obvious one.”
“And that’s where the real problem begins.”
“But the more interesting question is the one no one is asking.”
“And this is where things stop being simple.”
It doesn't really worry me but I think its interesting that LLM speak sounds so distinctive, and how willing these media personalities are to be so obvious in reading out on TV what the LLM spat out.
I've never studied what LLMs say in depth is it is interesting that my brain recognises the speech pattern so easily.
A writing teacher once excoriated me for saying that something was important. “Don’t tell me it’s important, show me, and let me decide, and if you do your job I’ll agree”
I don’t know how a completion can tell when it needs to do this. Mostly so far it doesn’t seem capable
BuzzFeed and Upworthy etc pioneered this for web 'news stories', then it got used in linkedin, twitter, and everywhere where views are more important than the content.
It's also exactly the Mr beast playbook, and got him to the largest channel on YouTube.
Any system attempting to capture human attention will use these techniques, nothing LLM-specific here at all.
No point creating busywork for yourself just shuffling words around when the information is there, no?
I guess it depends on what you want out of the article. Substance, or style?
Corporate announcements were never the places that literature and art were pushing the envelope. They were slop before, and they're slop now.
I ran it in LM Studio and got a pleasingly abstract pelican on a bicycle (genuinely not bad for a tiny 3B model - it can at least output valid SVG): https://gist.github.com/simonw/5f2df6093885a04c9573cf5756d34...
I have been using it with their Chunkless RAG concept and it is fitting very well! (for curious https://github.com/scub-france/Docling-Studio)
I convinced that SLM are a real parto of solution for true integrated AI in process...
It is not the researchers' fault that some slop got posted here instead.
The gap that still matters most isn't intelligence — it's consistency on structured output. When you chain 5+ tool calls in sequence, even a small per-call reliability difference compounds fast. Would love to see Granite 4.1 benchmarked specifically on multi-step function calling rather than just general benchmarks.
But I don’t think it necessarily saved training cost; if it did, I’d be interested to learn how!
https://arxiv.org/abs/2101.03961
I doubt MoE is actually worth it, given how complicated high-performance expert routing and training is. But who knows, I don't.
Link to HF collection: https://huggingface.co/collections/ibm-granite/granite-41-la...
If techniques existed to shift from "guess the next highly probable" token to "guess the best next logical step", as some interpreted said research, should not that be the foremost objective?
https://huggingface.co/collections/ibm-granite/granite-embed...
311M and 97M versions.
Granite Vision 4.1; Granite Speech 4.1; Granite Guardian 4.1; Granite Embedding Multilingual R2 - with, of course, the "Small Language Models"
https://research.ibm.com/blog/granite-4-1-ai-foundation-mode...
edit: I just realised they do actually have a 30b release alongside this. Haven't tried it yet.
An interesting choice
> While reasoning models have grown in popularity in recent years, their abilities aren’t always the most efficient way to get a result. In enterprise settings, token costs and speed are often as important as performance. That is why turning to less expensive, non-reasoning models with similar benchmark performance for select tasks like instruction following and tool calling makes sense for enterprise users.
I guess they currently don't have the ability to do proper RLVR.
Incidentally: I am trying to spend some time researching in the progresses in the area (the jump from parroting, to inconsistent apparent reasoning, to reliable reasoning).
Then something broke. The RLHF stage, while improving chat quality, caused math benchmark scores to drop. GSM8K and DeepMind-Math both regressed."
Observation: Math (which when fully decomposed, results in Logic) is at the core of how computers (traditional/older, non-LLM, programming languages work. If an LLM gets Math training wrong at any stage for any reason, then, in my opinion, that should be viewed as something that needs to be fixed at a lower level, not a higher one; not a later training level...
I think it would be interesting exercise to train an LLM that only deals in simple Math, simple English, and only the ability to compute simple equations (+,-,x,/)... like, what's the absolute minimum in terms of text and layers necessary to train a model like that?
I think some interesting understandings could be potentially be had by experimentation like that...
I myself would love a pure (simplest, smallest possible)
Text-to-Math only LLM (TTMLLM, TTMSLM?)
, along with all of the necessary corpuses (which would ideally be as small as possible) and instructions necessary to train such an LLM...
https://arxiv.org/pdf/2401.16818
With those results, I would've already done that in any models I got to train. There's also the principle that the LLM's are often better at what they saw last in their training set. That also justifies putting more logic, code, and math in at the end for an analytical or coding model. So, a few precedents for that technique already.
Granite 4.1 8B: "I apologize for the confusion earlier. Let's clarify the derivation of the VIDEO__MASK_00b based on the description "Bits 5-6 indicate the video mode" from the equipment_list section.
Given that bits 5 and 6 are used for video mode, we need to create a mask that isolates these two bits:
Bit 5 is the least significant bit of the group (2^5 = 32 decimal, 0b00100000 in 8-bit form). Bit 6 is the next bit up (2^6 = 64 decimal, 0b01000000 in 8-bit form). Combining these bits: ...
Therefore, the correct mask for VIDEO_MASK_00b should be 0x00C0"
Errors on top of errors when converting description into binary numbers. Its hopeless for basic task like parsing/generating headers :(
show me.
> Apache 2.0 across the board, so commercial use is clean.
Did you just stop when you saw open source and come post this here because you couldn't be bothered to... look at the project and see it's cleanly and clearly listed.
Edit: Like. I get it. It's fine to question open source. But this isn't hidden. It's repeated and made clear multiple times. They even link to the license: https://www.apache.org/licenses/LICENSE-2.0
It wasn't hidden, it wasn't in some weird, out-of-the-way place. In fact, I found it so easily that I genuinely questioned whether it was real because of your comment. Like, why would anyone post what you posted if it was this easy to find?
NOPE! It was right there.
If you check HF you will see its Apache2 and the datasets were also permissive.
It's one of the few models on the market where the creator indemnifies it against copyright claims.
https://research.ibm.com/blog/granite-ethical-ai
https://allenai.org/olmo
I'm just giving it as an example. I haven't looked at Granite's repos.