Back to News
Advertisement
Advertisement

⚡ Community Insights

Discussion Sentiment

78% Positive

Analyzed from 6755 words in the discussion.

Trending Topics

#model#models#run#context#quality#more#better#bit#don#unsloth

Discussion (208 Comments)Read Original on HackerNews

simonwabout 2 hours ago
The pelican is excellent for a 16.8GB quantized local model: https://simonwillison.net/2026/Apr/22/qwen36-27b/

I ran it on an M5 Pro with 128GB of RAM, but it only needs ~20GB of that. I expect it will run OK on a 32GB machine.

Performance numbers:

  Reading: 20 tokens, 0.4s, 54.32 tokens/s
  Generation: 4,444 tokens, 2min 53s, 25.57 tokens/s
I like it better than the pelican I got from Opus 4.7 the other day: https://simonwillison.net/2026/Apr/16/qwen-beats-opus/
throwaw12about 2 hours ago
I feel like this time it is indeed in the training set, because it is too good to be true.

Can you run your other tests and see the difference?

simonwabout 2 hours ago
It went pretty wild with "Generate an SVG of a NORTH VIRGINIA OPOSSUM ON AN E-SCOOTER":

https://gist.github.com/simonw/95735fe5e76e6fdf1753e6dcce360...

throwaw12about 2 hours ago
compared to your test with GLM 5.1, this indeed looks off

https://xcancel.com/simonw/status/2041646779553476801

m3kw9about 1 hour ago
if they cook these in, i wonder what else was cooked in there to make it look good.
zargonabout 1 hour ago
Everything is benchmaxxed. Whack-a-mole training is at least as representative of what is getting added to models as more general training advances.
ahoog42about 1 hour ago
at what point do model providers optimize for the "pelican riding a bicycle" test so they place well on Simon's influential benchmark? :-)
hansonkdabout 1 hour ago
They almost certainly are, even if unknowingly, because HN and all blogs get piped continuously into all models' training corpus.
anonzzziesabout 4 hours ago
I wish that all announcements of models would show what (consumer) hardware you can run this on today, costs and tok/s.
Aurornisabout 3 hours ago
The 27B model they release directly would require significant hardware to run natively at 16-bit: A Mac or Strix Halo 128GB system, multiple high memory consumer GPUs, or an RTX 6000 workstation card.

This is why they don’t advertise which consumer hardware it can run on: Their direct release that delivers these results cannot fit on your average consumer system.

Most consumers don’t run the model they release directly. They run a quantized model that uses a lower number of bits per weight.

The quantizations come with tradeoffs. You will not get the exact results they advertise using a quantized version, but you can fit it on smaller hardware.

The previous 27B Qwen3.5 model had reasonable performance down to Q5 or Q4 depending on your threshold for quality loss. This was usable on a unified memory system (Mac, Strix Halo) with 32GB of extra RAM, so generally a 64GB Mac. They could also be run on an nVidia 5090 with 32GB RAM or a pair of 16GB or 24GB GPUs, which would not run as fast due to the split.

Watch out for some of the claims about running these models on iPhones or smaller systems. You can use a lot of tricks and heavy quantization to run it on very small systems but the quality of output will not be usable. There is a trend of posting “I ran this model and this small hardware” repos for social media bragging rights but the output isn’t actually good.

ryandrakeabout 3 hours ago
Yea, this is currently the confusing part of running local models for newbies: Even after you have decided which model you want to run, and which org's quantizations to use (let's just assume Unsloth's for example), there are often dozens of quantizations offered, and choosing among them is confusing.

Say you have a GPU with 20GB of VRAM. You're probably going to be able to run all the 3-bit quantizations with no problem, but which one do you choose? Unsloth offers[1] four of them: UD-IQ3_XXS, Q3_K_S, Q3_K_M, UD-Q3_K_XL. Will they differ significantly? What are each of them good at? The 4-bit quantizations will be a "tight squeeze" on your 20GB GPU. Again, Unsloth steps up to the plate with seven(!!) choices: IQ4_XS, Q4_K_S, IQ4_NL, Q4_0, Q4_1, Q4_K_M, UD-Q4_K_XL. Holy shit where do I even begin? You can try each of them to see what fits on your GPU, but that's a lot of downloading, and then...

Once you [guess and] commit to one of the quantizations and do a gigantic download, you're not done fiddling. You need to decide at the very least how big a context window you need, and this is going to be trial and error. Choose a value, try to load the model, if it fails, you chose too large. Rinse and repeat.

Then finally, you're still not done. Don't forget the parameters: temperature, top_p, top_k, and so on. It's bewildering!

1: https://huggingface.co/unsloth/Qwen3.6-27B-GGUF

danielhanchenabout 3 hours ago
We made Unsloth Studio which should help :)

1. Auto best official parameters set for all models

2. Auto determines the largest quant that can fit on your PC / Mac etc

3. Auto determines max context length

4. Auto heals tool calls, provides python & bash + web search :)

Aurornisabout 3 hours ago
> Say you have a GPU with 20GB of VRAM. You're probably going to be able to run all the 3-bit quantizations with no problem, but which one do you choose? Unsloth offers[1] four of them: UD-IQ3_XXS, Q3_K_S, Q3_K_M, UD-Q3_K_XL

There are actually two problems with this:

First, the 3-bit quants are where the quality loss really becomes obvious. You can get it to run, but you’re not getting the quality you expected. The errors compound over longer sessions.

Second, you need room for context. If you have become familiar with the long 200K contexts you get with SOTA models, you will not be happy with the minimal context you can fit into a card with 16-20GB of RAM.

The challenge for newbies is learning to identify the difference between being able to get a model to run, and being able to run it with useful quality and context.

ndriscollabout 3 hours ago
Note that you could also run them on AMD (and presumably Intel) dGPUs. e.g. I have a 32GB R9700, which is much cheaper than a 5090, and runs 27B dense models at ~20 t/s (or MoE models with 3-4B active at ~80t/s). I expect an Arc B70 would also work soon if it doesn't already, and would likely be the price/perf sweet spot right now.

My R9700 does seem to have an annoying firmware or driver bug[0] that causes the fan to usually be spinning at 100% regardless of temperature, which is very noisy and wastes like 20+ W, but I just moved my main desktop to my basement and use an almost silent N150 minipc as my daily driver now.

[0] Or manufacturing defect? I haven't seen anyone discussing it online, but I don't know how many owners are out there. It's a Sapphire fwiw. It does sometimes spin down, the reported temperatures are fine, and IIRC it reports the fan speed as maxed out, so I assume software bug where it's just not obeying the fan curve

acrispinoabout 1 hour ago
I have 2x asrock R9700. One of the them was noticeably noisier than the other and eventually developed an annoying vibration while in the middle of its fan curve. Asrock replaced it under RMA.
zozbot234about 2 hours ago
Yup, I suppose that these smaller, dense models are in the lead wrt. fast inference with consumer dGPUs (or iGPUs depending on total RAM) with just enough VRAM to contain the full model and context. That won't give you anywhere near SOTA results compared to larger MoE models with a similar amount of active parameters, but it will be quite fast.
muyuuabout 3 hours ago
i have a Strix Halo machine

typically those dense models are too slow on Strix Halo to be practical, expect 5-7 tps

you can get an idea by looking at other dense benchmarks here: https://strixhalo.zurkowski.net/experiments - i'd expect this model to be tested here soon, i don't think i will personally bother

hedgehogabout 2 hours ago
This one is around 250 t/s prefill and 12.4 generation in my testing.
wuschelabout 2 hours ago
> but the quality of output will not be usable

Making the the right pick for model is one of the key problems as a local user. Do you have any references where one can see a mapping of problem query to model response quality?

alex7oabout 1 hour ago
Because when you pay for a subscription they don't silently quantize the model a few week after release, and you can no longer get the full model running.

Otherwise no need for full fp16, int8 works 99% as well for half the mem, and the lower you go the more you start to pay for the quants. But int8 is super safe imo.

Orasabout 2 hours ago
If these models reach quality of Opus 4.5, then DGX could be a good alternative for serious dev teams to run local models. It is not that expensive and has short time to make ROI
varispeed24 minutes ago
Is it the same idea that when you go to luxury store you don't see prices on display?

Seems like nobody wants to admit they exclude working class from the ride.

anonym29about 2 hours ago
You absolutely do not need to run at full BF16. The quality loss between BF16 (55.65 GB in GGUF) and Q8_0 (30.44 GB in GGUF) is essentially zero - think on the order of magnitude of +0.01-0.03 perplexity, or ~0.1-0.3% relative PPL increase. The quality loss between BF16 and Q4_K_M (18.66 GB in GGUF) is close to imperceptible, with perplexity changes in the +0.1-0.3 ballpark, or ~1-3% relative PPL increase. This would correlate to a 0-2% drop on downstream tasks like MMLU/GSM8K/HellaSwag: essentially indistinguishable.

You absolutely do NOT need a $3000 Strix Halo rig or a $4000 Mac or a $9000 RTX 6000 or "multiple high memory consumer GPUs" to run this model at extremely high accuracy. I say this as a huge Strix Halo fanboy (Beelink GTR 9 Pro), mind you. Where Strix Halo is more necessary (and actually offers much better performance) are larger but sparse MoE models - think Qwen 3.5 122B A10B - which offers the total knowledge (and memory requirements) of a 122B model, with processing and generation speed more akin to a 10B dense model, which is a big deal with the limited MBW we get in the land of Strix Halo (256 GB/s theoretical, ~220 GB/s real-world) and DGX Spark (273 GB/s theoretical - not familiar with real-world numbers specifically off the top of my head).

I would make the argument, as a Strix Halo owner, that 27B dense models are actually not particularly pleasant or snappy to run on Strix Halo, and you're much better off with those larger but sparse MoE models with far fewer active parameters on such systems. I'd much rather have an RTX 5090, an Arc B70 Pro, or an AMD AI PRO R9700 (dGPUs with 32GB of GDDR6/7) for 27B dense models specifically.

zozbot234about 2 hours ago
I'm all for running large MoE models on unified memory systems, but developers of inference engines should do a better job of figuring out how to run larger-than-total-RAM models on such systems, streaming in sparse weights from SSD but leveraging the large unified memory as cache. This is easily supported with pure-CPU inference via mmap, but there is no obvious equivalent when using the GPU for inference.
p_stuart82about 1 hour ago
tbh ~1-3% PPL hit from Q4_K_M stopped being the bottleneck a while ago. the bottleneck is the 48 hours of guessing llama.cpp flags and chat template bugs before the ecosystem catches up. you are doing unpaid QA.
benobabout 3 hours ago
I get ~5 tokens/s on an M4 with 32G of RAM, using:

  llama-server \
   -hf unsloth/Qwen3.6-27B-GGUF:Q4_K_M \
   --no-mmproj \
   --fit on \
   -np 1 \
   -c 65536 \
   --cache-ram 4096 -ctxcp 2 \
   --jinja \
   --temp 0.6 \
   --top-p 0.95 \
   --top-k 20 \
   --min-p 0.0 \
   --presence-penalty 0.0 \
   --repeat-penalty 1.0 \
   --reasoning on \
   --chat-template-kwargs '{"preserve_thinking": true}'
35B-A3B model is at ~25 t/s. For comparison, on an A100 (~RTX 3090 with more memory) they fare respectively at 41 t/s and 97 t/s.

I haven't tested the 27B model yet, but 35B-A3B often gets off rails after 15k-20k tokens of context. You can have it to do basic things reliably, but certainly not at the level of "frontier" models.

danielhanchenabout 3 hours ago
We also made some dynamic MLX ones if they help - it might be faster for Macs, but llama-server definitely is improving at a fast pace.

https://huggingface.co/unsloth/Qwen3.6-27B-UD-MLX-4bit

DarmokJalad1701about 1 hour ago
What exactly does the .sh file install? How does it compare to running the same model in, say, omlx?
dunbabout 3 hours ago
Why use --fit on on an M4? My understanding was that given the unified memory, you should push all layers to the GPU with --n-gpu-layers all. Setting --flash-attn on and --no-mmap may also get you better results.
kpw94about 2 hours ago
When you say tok/s here are you describing the prefill (prompt eval) token/s or the output generation tok/s?

(Btw I believe the "--jinja" flag is by default true since sometime late 2025, so not needed anymore)

zargonabout 2 hours ago
If someone doesn't specifically say prefill then they always mean decode speed. I have never seen an exception. Most people just ignore prefill.
wuschelabout 1 hour ago
How is the quality of model answers to your queries? Are they stable over time?

I am wondering how to measure that anyway.

cyanydeez38 minutes ago
Using opencode and Qwen-Coder-Next I get it reliably up to about 85k before it takes too long to respond.

I tried the other qwen models and the reasoning stuff seems to do more harm than good.

bityardabout 3 hours ago
There are infinite combinations of CPU/GPU capable of running LLMs locally. What most people do is buy the system they can afford and roughly meets their goals and then ball-park VRAM usage by looking at the model size and quantization.

For more a detailed analysis, there are several online VRAM calculators. Here's one: https://smcleod.net/vram-estimator/

If you have a huggingface account, you can set your system configuration and then you get little icons next to each quant in the sidebar. (Green: will likely fit, Yellow: Tight fit, Red: will not fit)

Further, t/s depends greatly on a lot of different factors, the best you might get is a guess based on context size.

One thing about running local LLMs right now, is that there are tradeoffs literally everywhere and you have to choose what to optimize for down to the individual task.

proxysnaabout 3 hours ago
Qwen3.5-27B with a 4bit quant can be run on a 24G card with no problem. With 2 Nvidia L4 cards and some additional vllm flags, i am serving 10 developers at 20-25tok/sek, off-peak is around 40tok/sek. Developers are ok with that performance, but ofc they requested more GPU's for added throughput.
tandrabout 3 hours ago
What would be these additional vllm flags, if you don't mind sharing?
PcChipabout 2 hours ago
question: why not use something like Claude? is it for security reasons?
lambdaabout 1 hour ago
Some people would rather not hand over all of their ability to think to a single SaaS company that arbitrarily bans people, changes token limits, tweaks harnesses and prompts in ways that cause it to consume too many tokens, or too few to complete the task, etc.

I don't use any non-FLOSS dev tools; why would I suddenly pay for a subscription to a single SaaS provider with a proprietary client that acts in opaque and user hostile ways?

UncleOxidantabout 3 hours ago
For Qwen3.5-27b I'm getting in the 20 to 25 tok/sec range on a 128GB Strix Halo box (Framework Desktop). That's with the 8-bit quant. It's definitely usable, but sometimes you're waiting a bit, though I'm not finding it problematic for the most part. I can run the Qwen3-coder-next (80b MoE) at 36tok/sec - hoping they release a Qwen3.6-coder soon.
lambdaabout 1 hour ago
That sounds high for a Strix Halo with a dense 27b model. Are you talking about decode (prompt eval, which can happend in parallel) or generation when you quote tokens per second? Usually if people quote only one number they're quoting generation speed, and I would be surprised if you got that for generation speed on a Strix Halo.
bityardabout 2 hours ago
I have a Framework Desktop too and 20-25 t/s is a lot better than I was expecting for such a large dense model. I'll have to try it out tonight. Are you using llama.cpp?
UncleOxidantabout 2 hours ago
LMStudio, but it uses llama.cpp to run inference, so yeah. This is with the vulkan backend, not ROCm.
petuabout 2 hours ago
> Qwen3.5-27b 8-bit quant 20 to 25 tok/sec

It that with some kind of speculative decoding? Or total throughput for parallel requests?

ekojsabout 3 hours ago
As this is a dense model and it's pretty sizable, 4-bit quantization can be nearly lossless. With that, you can run this on a 3090/4090/5090. You can probably even go FP8 with 5090 (though there will be tradeoffs). Probably ~70 tok/s on a 5090 and roughly half that on a 4090/3090. With speculative decoding, you can get even faster (2-3x I'd say). Pretty amazing what you can get locally.
Aurornisabout 3 hours ago
> As this is a dense model and it's pretty sizable, 4-bit quantization can be nearly lossless

The 4-bit quants are far from lossless. The effects show up more on longer context problems.

> You can probably even go FP8 with 5090 (though there will be tradeoffs)

You cannot run these models at 8-bit on a 32GB card because you need space for context. Typically it would be Q5 on a 32GB card to fit context lengths needed for anything other than short answers.

alex7oabout 1 hour ago
Turboquant on 4bit helps a lot as well for keeping context in vram, but int4 is definitely not lossless. But it all depends for some people this is sufficient
ekojsabout 3 hours ago
> You cannot run these models at 8-bit on a 32GB card because you need space for context

You probably can actually. Not saying that it would be ideal but it can fit entirely in VRAM (if you make sure to quantize the attention layers). KV cache quantization and not loading the vision tower would help quite a bit. Not ideal for long context, but it should be very much possible.

I addressed the lossless claim in another reply but I guess it really depends on what the model is used for. For my usecases, it's nearly lossless I'd say.

zozbot234about 3 hours ago
4-bit quantization is almost never lossless especially for agentic work, it's the lowest end of what's reasonable. It's advocated as preferable to a model with fewer parameters that's been quantized with more precision.
ekojsabout 3 hours ago
Yeah, figure the 'nearly lossless' claim is the most controversial thing. But in my defense, ~97% recovery in benchmarks is what I consider 'nearly lossless'. When quantized with calibration data for a specialized domain, the difference in my internal benchmark is pretty much indistinguishable. But for agentic work, 4-bit quants can indeed fall a bit short in long-context usecase, especially if you quantize the attention layers.
binary132about 3 hours ago
That seems awfully speculative without at least some anecdata to back it up.
arcanemachinerabout 3 hours ago
Sure, go get some.

This isn't the first open-weight LLM to be released. People tend to get a feel for this stuff over time.

Let me give you some more baseless speculation: Based on the quality of the 3.5 27B and the 3.6 35B models, this model is going to absolutely crush it.

ekojsabout 3 hours ago
Not at all, I actually run ~30B dense models for production and have tested out 5090/3090 for that. There are gotchas of course, but the speed/quality claims should be roughly there.
xngbuildsabout 3 hours ago
For Apple Mac, there is https://omlx.ai/benchmarks
chrswabout 3 hours ago
These might help if the provider doesn't offer the same details themselves. Of course, we have to wait for the newly released models to get added to these sites.

https://llmfit.io/

https://modelfit.io/

rubiquityabout 3 hours ago
At 8-bit quantization (q8_0) I get 20 tokens per second on a Radeon R9700.
esskayabout 1 hour ago
CaniRun's not a great tool - look how long its been since it's been updated. It's not got any of the qwen3.6 models on the list nor the new kimi one. In fact it's missing many of the "popular" models.
jjcmabout 3 hours ago
Fwiw, huggingface does this on the page where you download the weights. Slightly different format though - you put all the hardware you have, and it shows which quants you can run.
arcanemachinerabout 3 hours ago
Divide the value before the B by 2, and there's your answer if you get a Q4_K_M quant. Plus a bit of room for KV cache.

TLDR: If you have 14GB of VRAM, you can try out this model with a 4-bit quant.

Tokens per second is an unreasonable ask since every card is different, are you using GGUF or not, CUDA or ROCm or Vulkan or MLX, what optimizations are in your version of your inference software, flags are you running, etc.

Note that it's a dense model (the Qwen models have another value at the end of the MoE model names, e.g. A3B) so it will not run very well in RAM, whereas with a MoE model, you can spill over into RAM if you don't have enough VRAM, and still have reasonable performance.

Using these models requires some technical know-how, and there's no getting around that.

neonstaticabout 1 hour ago
I just ran Unsloth's Q4_K_M quant on Macbook Pro M1 Max (2021), 32 GB RAM. Getting about 9 tok/s. Reported RAM usage 15.9 GB. It's okay, but a bit slow for my taste.

An example of a query and a response is below. It was done with 9.09 tok/s, 893 tokens, 7.56 seconds to first token, thought for 1 minute and 9 seconds.

Question: I gave the model this truth table and asked to tell me what boolean function would produce the output given A, B as inputs.

  | A | B | Output |
  | :--- | --- | ---: |
  | (T) | (F)  | (T) |
  | (T) | (T) | (F) |
  | (F) | (F) | (F) |
  | (F) | (T) | (F) |
Answer: (note this is part of a larger conversation, where I was getting it to help me with some boolean logic):

  The boolean condition for your table is simply:

  ```rust
  A && !B
  ```

  ### Why?
  Your table shows the output is `true` **only** when `A` is true and `B` is  false. That's exactly what `A && !B` expresses.

  | A | B | `A && !B` | Output |
  |---|---|-----------|--------|
  |  (T) |  (F) | `T && T = T` |   |
  |  (T) |  (T) | `T && F = F` |   |
  |  (F) |  (F) | `F && T = F` |   |
  |  (F) |  (T) | `F && F = F` |   |

  ### In Rust:
  ```rust
  if A && !B {
      // output is true
  } else {
      // output is false
  }
  ```

  This is the most direct and idiomatic way to express that logic. Let me know if you need it adapted back to your `live_ticks` / `on_disconnect` context!
scosmanabout 1 hour ago
this is what the unsloth post is for
CamperBob2about 2 hours ago
27B dense is not a screamer, even on an RTX 6000, but it will run at full precision with (more than) enough room for context at the model's own capacity. You can expect about 30 tokens/second after prompt processing. Quants will likely run similarly well on the 16/24/32 GB consumer GPUs.

The 3.5 27B model was a strong and capable reasoner, so I have high hopes for this one. Thanks to the team at Qwen for keeping competition in this space alive.

underlinesabout 4 hours ago
depends on format, compute type, quantization and kv cache size.
mottossoabout 4 hours ago
Specs for whatever they used to achieve the benchmarks would be a good start.
bityardabout 2 hours ago
The benchmarks in the model card are purported to be measurements of model quality (ability to perform tasks with few errors), not speed.

They almost certainly run these benchmarks on their own cloud infrastructure (Alibaba afaik), which is typically not hardware that even the most enthusiastic homelab hobbyist can afford.

Aurornisabout 3 hours ago
The benchmarks are from the unquantized model they release.

This will only run on server hardware, some workstation GPUs, or some 128GB unified memory systems.

It’s a situation where if you have to ask, you can’t run the exact model they released. You have to wait for quantizations to smaller sizes, which come in a lot of varieties and have quality tradeoffs.

jauntywundrkindabout 3 hours ago
I would detest the time/words it takes to hand hold through such a review, of teaching folks the basics about LLM like this.

It's also a section that, with hope, becomes obsolete sometime semi soon-ish.

syntaxingabout 2 hours ago
Been using Qwen 3.6 35B and Gemma 4 26B on my M4 MBP, and while it’s no Opus, it does 95% of what I need which is already crazy since everything runs fully local.
FuckButtonsabout 1 hour ago
It’s good enough that I’ve been having codex automate itself out of a job by delegating more and more to it.

Very excited for the 122b version as the throughput is significantly better for that vs the dense 27b on my m4.

throwaw12about 2 hours ago
can you expand more on what you mean by 95%?

There are 2 aspects I am interested in:

1. accuracy - is it 95% accuracy of Opus in terms of output quality (4.5 or 4.6)?

2. capability-wise - 95% accuracy when calling your tools and perform agentic work compared to Opus - e.g. trip planning?

syntaxingabout 2 hours ago
1. What do you mean by accuracy? Like the facts and information? If so, I use a Wikipedia/kiwx MCP server. Or do you mean tool call accuracy?

2. 3.6 is noticeably better than 3.5 for agentic uses (I have yet to use the dense model). The downside is that there’s so little personality, you’ll find more entertainment talking to a wall. Anything for creative use like writing or talking, I use Gemma 4. I also use Gemma 4 as a “chat” bot only, no agents. One amazing thing about the Gemma models is the vision capabilities. I was able to pipe in some handwritten notes and it converted into markdown flawlessly. But my handwriting is much better than the typical engineer’s chicken scratch.

throwaw12about 2 hours ago
by accuracy I meant how close is the output to your expectations, for example if you ask 8B model to write C compiler in C, it outputs theory of how to write compiler and writes pseudocode in Python. Which is off by 2 measures: (1) I haven't asked for theory (2) I haven't asked to write it in Python.

Or if you want to put it differently, if your prompt is super clear about the actions you want it to do, is it following it exactly as you said or going off the rails occasionally

jamesonabout 3 hours ago
What competitive advantage does OpenAI/Anthropic has when companies like Qwen/Minimax/etc are open sourcing models that shows similar (yet below than OpenAI/Anthropic) benchmark results?

Also, the token prices of these open source models are at a fraction of Anthropic's Opus 4.6[1]

[1]: https://artificialanalysis.ai/models/#pricing

fnordpigletabout 3 hours ago
For coding often quality at the margin is crucial even at a premium. It’s not the same as cranking out spam emails or HN posts at scale. This is why the marginal difference between your median engineer and your P99 engineer is comp is substantial, while the marginal comp difference between your median pick and packer vs your P99 pick and packer isn’t.

I’d also say it keeps the frontier shops competitive while costing R&D in the present is beneficial to them in forcing them to make a better and better product especially in value add space.

Finally, particularly for Anthropic, they are going for the more trustworthy shop. Even ali is hosting pay frontier models for service revenue, but if you’re not a Chinese shop, would you really host your production code development workload on a Chinese hosted provider? OpenAI is sketchy enough but even there I have a marginal confidence they aren’t just wholesale mining data for trade secrets - even if they are using it for model training. Anthropic I slightly trust more. Hence the premium. No one really believes at face value a Chinese hosted firm isn’t mass trolling every competitive advantage possible and handing back to the government and other cross competitive firms - even if they aren’t the historical precedent is so well established and known that everyone prices it in.

bigbadfelineabout 2 hours ago
> For coding often quality at the margin is crucial even at a premium

That's a cryptic way to say "Only for vibe-coding quality at the margin matters". Obviously, quality is determined first and foremost by the skills of the human operating the LLM.

> No one really believes at face value a Chinese hosted firm isn’t mass trolling every competitive advantage possible

That's much easier to believe than the same but applied to a huge global corp that operates in your own market and has both the power and the desire to eat your market share for breakfast, before the markets open, so "growth" can be reported the same day.

Besides, open models are hosted by many small providers in the US too, you don't have to use foreign providers per se.

fnordpigletabout 1 hour ago
1) model provider choices don’t obviate the need to make other good choices

2) I think there is a special case for Chinese providers due to the philosophical differences in what constitutes fair markets and the regulatory and civil legal structure outside China generally makes such things existentially dangerous to do; hence while it might happen it is extraordinarily ill advised, while in China is implicitly the way things work. However my point is Ali has their own hosted version of Qwen models operating on the frontier that are at minimum hosted exclusively before released. Theres no reason to believe they won’t at some point exclusively host some frontier or fine tuned variants for purposes for commercial reasons. This is part of why they had recent turnover.

rmacqueen16 minutes ago
> This is why the marginal difference between your median engineer and your P99 engineer is comp is substantial, while the marginal comp difference between your median pick and packer vs your P99 pick and packer isn’t.

That's an interesting analogy.

donmcronaldabout 2 hours ago
Given the very limited experience I have where I've been trying out a few different models, the quality of the context I can build seems to be much more of an issue than the model itself.

If I build a super high quality context for something I'm really good at, I can get great results. If I'm trying to learn something new and have it help me, it's very hit and miss. I can see where the frontier models would be useful for the latter, but they don't seem to make as much difference for the former, at least in my experience.

The biggest issue I have is that if I don't know a topic, my inquiries seem to poison the context. For some reason, my questions are treated like fact. I've also seen the same behavior with Claude getting information from the web. Specifically, I had it take a question about a possible workaround from a bug report and present it as a de-facto solution to my problem. I'm talking disconnect a remote site from the internet levels of wrong.

From what I've seen, I think the future value is in context engineering. I think the value is going to come from systems and tools that let experts "train" a context, which is really just a search problem IMO, and a marketplace or standard for sharing that context building knowledge.

The cynic in me thinks that things like cornering the RAM market are more about depriving everyone else than needing the resources. Whoever usurps the most high quality context from those P99 engineers is going to have a better product because they have better inputs. They don't want to let anyone catch up because the whole thing has properties similar to network effects. The "best" model, even if it's really just the best tooling and context engineering, is going to attract the best users which will improve the model.

It makes me wonder of the self reinforced learning is really just context theft.

rohansood15about 2 hours ago
Most code is not P99 though.

Also, have you considered that your trust in Anthropic and distrust in China may not be shared by many outside the US? There's a reason why Huawei is the largest supplier of 5G hardware globally.

runjakeabout 2 hours ago
You're right, but perspective is important, and that's because China and the US are engaged in economic warfare (even before the current US regime), vying for the dubious title of "superpower".
fnordpigletabout 1 hour ago
I find it hard to believe anyone who has ever done business inside China doesn’t know that the structure of Chinese business is built around massive IP theft and repurposing on a state wide systematic level. It’s not a nationalism point, it’s an objective and easily verified truth.

Most code is not P99, but companies pay a premium to produce code that is. That’s my point.

AJ007about 2 hours ago
Not sure how your last point matters if 27b can run on consumer hardware, besides being hosted by any company which the user could certainly trust more than anthropic.

OpenAI & Anthropic are just lying to everyone right now because if they can't raise enough money they are dead. Intelligence is a commodity, the semiconductor supply chain is not.

datadrivenangelabout 1 hour ago
The challenge is token speed. I did some local coding yesterday with qwen3.6 35b and getting 10-40 tokens per second means that the wall time is much longer. 20 tokens per second is a bit over a thousand tokens per minute, which is slower than the the experience you get with Claude Code or the opus models.

Slower and worse is still useful, but not as good in two important dimensions.

otabdeveloper4about 2 hours ago
> For coding often quality at the margin is crucial even at a premium.

For coding, quality is not measurable and is based entirely on feels (er, sorry, "vibes").

Employers paying for SOTA models is nothing but a lifestyle status perk for employees, like ping-pong tables or fancy lunch snacks.

fnordpiglet42 minutes ago
I’m building my own company and I consider model choice crucial to my marginal ability to produce a higher quality product I don’t regret having built. Every higher end dev shop I’ve worked at over the last few years perceives things the same. There are measurable outcomes from software built well and software not, even if the code itself isn’t easily measurable. I would rather pay a few thousand more per year for a better overall outcome with less developer struggle against bad model decisions than end up with an inferior end product and have expensive developer spin wheels containing a dumb as a brick model. But everyone’s career experiences are different and I’d feel sad to work at a place where SOTA is a lifestyle choice rather than a rational engineering and business choice.
Aurornisabout 3 hours ago
I use Opus and the Qwen models. The gap between them is much larger than the benchmark charts show.

If you want to compare to a hosted model, look toward the GLM hosted model. It’s closest to the big players right now. They were selling it at very low prices but have started raising the price recently.

mchusmaabout 2 hours ago
I like both GLM and Kimmi 2.6 but honestly for me they didn’t have quite the cost advantage that I would like partly because they use more tokens so they end up being maybe sonnet level intelligence at haiku level cost. Good but not quite as extreme as some people would make them out to be and for my use cases running the much cheaper, Gemma 4 four things where I don’t need Max intelligence and running sonnet or opus for things where I need the intelligence and I can’t really make the trade-off is been generally good and it just doesn’t seem worth it to cost cut a little bit. Plus when you combine prompt, cashing and sub agents using Gemma 4, the cost to run sonnet or even opus, are not that extreme.

For coding $200 month plan is such a good value from anthropic it’s not even worth considering anything else except for up time issues

But competition is great. I hope to see Anthropic put out a competitor in the 1/3 to 1/5 of haiku pricing range and bump haiku’s performance should be closer to sonnet level and close the gap here.

syntaxingabout 2 hours ago
Yes and no. Are you using open router or local? Are the models are good as Opus? No. But 99% of the time, local models are terrible because of user errors. Especially true for MoE, even though the perplexity only drops minimal for Q4 and q4_0 for the KV cache, the models get noticeably worse.
acidtechno303about 2 hours ago
Sounds like you're accusing a professional of holding their tool incorrectly. Not impossible, but not likely either.
Frannkyabout 3 hours ago
If these results are because of vampire attacks, the results will stop being so good when closed ones figure out how to pollute them when they are sucking answers.

Also, they are not exactly as good when you use them in your daily flow; maybe for shallow reasoning but not for coding and more difficult stuff. Or at least I haven't found an open one as good as closed ones; I would love to, if you have some cool settings, please share

jstummbilligabout 1 hour ago
> yet below than OpenAI/Anthropic

This is the competitive advantage. Being better.

mmmoreabout 2 hours ago
The token prices being high for Opus undermines your argument, because it shows people are willing to pay more for the model.

The thing is the new OpenAI/Anthropic models are noticeably better than open source. Open source is not unusable, but the frontier is definitely better and likely will remain so. With SWE time costing over $1/min, if a convo costs me $10 but saves me 10 minutes it's probably worth it. And with code, often the time saved by marginally better quality is significant.

2001zhaozhao33 minutes ago
I'm kind of interested in a setup where one buys local hardware specifically to run a crap ton of small-to-medium LLM locally 24/7 at high throughput. These models might now be smart enough to make all kinds of autonomous agent workflows viable at a cheap price, with a good queue prioritization system for queries to fully utilize the hardware.
sietsietnoacabout 3 hours ago
Generate an SVG of a pelican riding a bicycle: https://codepen.io/chdskndyq11546/pen/yyaWGJx

Generate an SVG of a dragon eating a hotdog while driving a car: https://codepen.io/chdskndyq11546/pen/xbENmgK

Far from perfect, but it really shows how powerful these models can get

tlnabout 3 hours ago
The dragon image has issues like one eye, weird tail etc, but the pelican is imo perfect -- the best I've seen!
vunderbaabout 1 hour ago
Yeah the dragon one is just a complete mess. The car is sideways but the WHEEL is oriented in a first-person perspective.

Seems like a case of overfitting with regard to the thousands of pelican bike SVG samples on the internet already.

yrds96about 2 hours ago
I wonder if this became a so well known "benchmark" that models already got trained for it.
HotHotLavaabout 2 hours ago
Given that the pelican looks way better than the dragon, it almost seems like a certainty.
sietsietnoacabout 2 hours ago
Given the likeness of the sky between the 2 examples, the overall similarities and the fact that the pelican is so well done, there is 0-doubt that the benchmark is in the training data of these models by now

That doesn't make it any less of an achievement given the model size or the time it took to get the results

If anything, it shows there's still much to discover in this field and things to improve upon, which is really interesting to watch unfold

Marciplanabout 2 hours ago
every model release Simon comes with his Pelican and then this comment follows.

Can we stop both? its so boring

refulgentisabout 2 hours ago
I really appreciate you speaking up. Happened yesterday on GPT Image 2, bit my tongue b/c people would see it as fun policing, and same thing today. And it happens on every. single. LLM. release. thread.

It's disruptive to the commons, doesn't add anything to knowledge of a model at this point, and it's way out of hand when people are not only engaging with the original and creating screenfuls to wade through before on-topic content, but now people are creating the thread before it exists to pattern-match on the engagement they see for the real thing. So now we have 2x.

xrd10 minutes ago
I'm experimenting with this on my RTX 3090 and opencode. It is pretty impressive so far.
vibe42about 3 hours ago
Q4-Q5 quants of this model runs well on gaming laptops with 24GB VRAM and 64GB RAM. Can get one of those for around $3,500.

Interesting pros/cons vs the new Macbook Pros depending on your prefs.

And Linux runs better than ever on such machines.

doixabout 3 hours ago
What laptop has that much VRAM and RAM for $3500 with good/okay-ish Linux support? I was looking to upgrade my asus zephyrus g14 from 2021 and things were looking very expensive. Decided to just keep it chugging along for another year.

Then again, I was looking in the UK, maybe prices are extra inflated there.

kroatonabout 3 hours ago
A3B-35B is better suited for laptops with enough VRAM/RAM. This dense model however will be bandwidth limited on most cards.

The 5090RTX mobile sits at 896GB/s, as opposed to the 1.8TB/s of the 5090 desktop and most mobile chips have way smaller bandwith than that, so speeds won't be incredible across the board like with Desktop computers.

jadboxabout 3 hours ago
I find A3B-35B as an ideal model for small local projects- definitely the best for me so far
mark_l_watsonabout 1 hour ago
I have been running the slightly larger 31B model for local coding:

ollama launch claude --model qwen3.6:35b-a3b-nvfp4

This has been optimized for Apple Silicon and runs well on a 32G ram system. Local models are getting better!

yougotwillabout 1 hour ago
Can I ask how much RAM of the 32GB does it use? For example can I run a browser and VS Code at the same time?
vladgurabout 4 hours ago
This is getting very close to fit a single 3090 with 24gb VRAM :)
skiing_crawling13 minutes ago
I used to run qwen3.5 27b Q4_k_M on a single 3090 with these llama-server flags successfully: `-ngl 99 -c 262144 -fa on --cache-type-k q4_0 --cache-type-v q4_0`
originalvichyabout 4 hours ago
Yup! Smaller quants will fit within 24GB but they might sacrifice context length.

I’m excited to try out the MLX version to see if 32GB of memory from a Pro M-series Mac can get some acceptable tok/s with longer context. HuggingFace has uploaded some MLX versions already.

donmcronaldabout 3 hours ago
I have an Mini M4 Pro with 64GB of 273GB/s memory bandwidth and it's borderline with 3.5-27B. I assume this one is the same. I don't know a ton, but I think it's the memory bandwidth that limits it. It's similar on a DGX Spark I have access to (almost the same memory bandwidth).

It's been a while since I tried it, but I think I was getting around 12-15 tokens per second an that feels slow when you're used to the big commercial models. Whenever I actually want to do stuff with the open source models, I always find myself falling back to OpenRouter.

I tried Intel/Qwen3.6-35B-A3B-int4-AutoRound on a DGX Spark a couple days ago and that felt usable speed wise. I don't know about quality, but that's like running a 3B parameter model. 27B is a lot slower.

I'm not sure if I "get" the local AI stuff everyone is selling. I love the idea of it, but what's the point of 128GB of shared memory on a DGX Spark if I can only run a 20-30GB model before the slow speed makes it unusable?

ycui1986about 3 hours ago
32GB RAM on mac also need to host OS, software, and other stuff. There may not even be 24GB VRAM left for the model.
GaggiXabout 3 hours ago
At 4-bit quantization it should already fit quite nicely.
Aurornisabout 3 hours ago
Unfortunately not with a reasonable context length.
kkzz99about 3 hours ago
It really depends on what you think a reasonable context length is, but I can get 50k-60k on a 4090.
GaggiXabout 2 hours ago
The model uses Gated DeltaNet and Gated Attention so the memory usage of the KV cache is very low, even at BF16 precision.
butzabout 2 hours ago
Are there any "optimized" models, that have lesser hardware requirements and are specialised in single programming language, e.g. C# ?
zargonabout 1 hour ago
LLMs need diverse and extensive training data to be good at a specific thing. We don't (yet?) know how to train a small model that is really good at one programming language. Just big models that are good at a variety of languages (plus lots of other things).
Abby_101about 1 hour ago
Sort of - there's Qwen3-Coder and the Codestral family, but those are still multi-language, just code-focused. For truly single-language specialization, the practical path is fine-tuning an existing base model on a narrow distribution rather than training from scratch.

The issue with C# specifically is dataset availability. Open source C# code on GitHub is a fraction of Python/JS, and Microsoft hasn't released a public corpus the way Meta has for their code models. You'd probably get further fine-tuning Qwen3-Coder (or a similar base) on your specific codebase with LoRA than waiting for a dedicated C#-only model to appear.

Advertisement
originalvichyabout 4 hours ago
Good news!

Friendly reminder: wait a couple weeks to judge the ”final” quality of these free models. Many of them suffer from hidden bugs when connected to an inference backend or bad configs that slow them down. The dev community usually takes a week or two to find the most glaring issues. Some of them may require patches to tools like llama.cpp, and some require users to avoid specific default options.

Gemma 4 had some issues that were ironed out within a week or two. This model is likely no different. Take initial impressions with a grain of salt.

jjcmabout 3 hours ago
This is probably less likely with this model, as it’s almost certainly a further RL training continuation of 3.5 27b. The bugs with this architecture were worked out when that dropped.
originalvichyabout 3 hours ago
Valuable note!
Aurornisabout 3 hours ago
Good advice for all new LLM experimenters.

The bugs come from the downstream implementations and quantizations (which inherit bugs in the tools).

Expect to update your tools and redownload the quants multiple times over 2-4 weeks. There is a mad rush to be first to release quants and first to submit PRs to the popular tools, but the output is often not tested much before uploading.

If you experiment with these on launch week, you are the tester. :)

UncleOxidantabout 3 hours ago
I've been waiting for this one. I've been using 3.5-27b with pretty good success for coding in C,C++ and Verilog. It's definitely helped in the light of less Claude availability on the Pro plan now. If their benchmarks are right then the improvement over 3.5 should mean I'm going to be using Claude even less.
amunozoabout 5 hours ago
A bit skeptical about a 27B model comparable to opus...
originalvichyabout 4 hours ago
For at least a year now, it has been clear that data quality and fine-tuning are the main sources of improvement for mediym-level models. Size != quality for specialized, narrow use cases such as coding.

It’s not a surprise that models are leapfrogging each other when the engineers are able to incorporate better code examples and reasoning traces, which in turn bring higher quality outputs.

cbg0about 3 hours ago
If all you're looking at is benchmarks that might be true, but those are way too easy to game. Try using this model alongside Opus for some work in Rust/C++ and it'll be night and day. You really can't compare a model that's got trillions of parameters to a 27B one.
otabdeveloper4about 2 hours ago
> ...and it'll be night and day.

That's just, like, your opinion, man.

> You really can't compare a model that's got trillions of parameters to a 27B one.

Parameter count doesn't matter much when coding. You don't need in-depth general knowledge or multilingual support in a coding model.

rubiquityabout 3 hours ago
You should try it out. I'm incredibly impressed with Qwen 3.5 27B for systems programming work. I use Opus and Sonnet at work and Qwen 3.x at home for fun and barely notice a difference given that systems programming work needs careful guidance for any model currently. I don't try to one shot landing pages or whatever.
bityardabout 2 hours ago
Are you using the same agent/harness/whatever for both Claude and Qwen, or something different for each one?
rubiquityabout 2 hours ago
I use Pi at home and Claude Code at work (no choice). I use bone stock Pi. No extensions.
Aurornisabout 3 hours ago
You should be skeptical. Benchmark racing is the current meta game in open weight LLMs.

Every release is accompanied by claims of being as good as Sonnet or Opus, but when I try them (even hosted full weights) they’re far from it.

Impressive for the size, though!

jjcmabout 3 hours ago
Opus 4.5 mind you, but I’m not too surprised given how good 3.5 was and how good the qwopus fine tune was. The model was shown to benefit heavily from further RL.
esafakabout 4 hours ago
Some of these benchmarks are supposedly easy to game. Which ones should we pay attention to?
NitpickLawyerabout 2 hours ago
SWE-REbench should not be gameable. They collect new issues from live repos, and if you check 1-2 months after a model was released, you can get an idea. But even that would be "benchmaxxxable", which is an overloaded term that can mean many things, but the most vanilla interpretation is that with RL you can get a model to follow a certain task pretty well, but it'll get "stuck" on that task type, or "stubborn" when asked similar but sufficiently different tasks. So for swe-rebench that would be "it fixes bugs in these types of repos, under this harness, but ask it to do soemthing else in a repo and you might not get the same results". In a nutshell.
underlinesabout 4 hours ago
well, your own, unleaked ones, representing your real workloads.

if you can't afford to do that, look at a lot of them, eg. on artificialanalysis.com they merge multiple benchmarks across weighted categories and build an Intelligence Score, Coding Score and Agentic score.

cbg0about 2 hours ago
None. Try them out with your own typical tasks to see the performance.
WarmWashabout 3 hours ago
ARC-AGI 2

GLM 5 scores 5% on the semi-private set, compared to SOTA models which hover around 80%.

cmrdporcupineabout 3 hours ago
A small model can be made to be "comparable to Opus" in some narrow domains, and that's what they've done here.

But when actually employed to write code they will fall over when they leave that specific domain.

Basically they might have skill but lack wisdom. Certainly at this size they will lack anywhere close to the same contextual knowledge.

Still these things could be useful in the context of more specialized tooling, or in a harness that heavily prompts in the right direction, or as a subagent for a "wiser" larger model that directs all the planning and reviews results.

wesammikhailabout 4 hours ago
you'd be surprised how good small models have gotten. Size of the model isnt all that matters.
freedombenabout 4 hours ago
Plus you can control thinking time a lot more, so when Anthropic lobotomizes Opus on you...
verdvermabout 4 hours ago
My experience with qwen-3.6:35B-A3B reinforces this, gonna give this a spin when unsloth has quants available

Gemini flash was just as good as pro for most tasks with good prompts, tools, and context. Gemma 4 was nearly as good as flash and Qwen 3.6 appears to be even better.

cassianolealabout 4 hours ago
> when unsloth has quants available

https://huggingface.co/unsloth/Qwen3.6-27B-GGUF

dudefelicianoabout 3 hours ago
> Size of the model isnt all that matters.

What matters is the motion in the tokens

jedisct144 minutes ago
I really like local models for code reviews / security audits.

Even if they don't run super fast, I can let them work overnight and get comprehensive reports in the morning.

I used Qwen3.6-27B on an M5 (oq8, using omlx) and Swival (https://swival.dev) /audit command on small code bases I use for benchmarking models for security audits.

It found 8 out of 10, which is excellent for a local model, produced valid patches, and didn't report any false positives. which is even better.

pamaabout 4 hours ago
Has anyone tested it at home yet and wants to share early impressions?
lreevesabout 4 hours ago
I have been kicking the tires for about 40 minutes since it downloaded and it seems excellent at general tasks, image comprehension and coding/tool-calling (using VLLM to serve it). I think it squeaks past Gemma4 but it's hard to tell yet.
alfonsodevabout 4 hours ago
good to hear! Do you mind sharing your setup and tokens / seconds performance ?
lreevesabout 3 hours ago
I'm running the unquantized base model on 2xA6000s (Ampere gen, 48GB each). Runs at about 25 tokens/second.
LowLevelKernelabout 1 hour ago
How much VRAM is needed?
awestrokeabout 1 hour ago
27 multiplied by quant, add context
spwa4about 4 hours ago
genpfaultabout 3 hours ago
Getting ~36-33 tok/s (see the "S_TG t/s" column) on a 24GB Radeon RX 7900 XTX using llama.cpp's Vulkan backend:

    $ llama-server --version
    version: 8851 (e365e658f)

    $ llama-batched-bench -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    1.529 |   654.11 |    3.470 |    36.89 |    4.999 |   225.67 |
    |  2000 |    128 |    1 |   2128 |    3.064 |   652.75 |    3.498 |    36.59 |    6.562 |   324.30 |
    |  4000 |    128 |    1 |   4128 |    6.180 |   647.29 |    3.535 |    36.21 |    9.715 |   424.92 |
    |  8000 |    128 |    1 |   8128 |   12.477 |   641.16 |    3.582 |    35.73 |   16.059 |   506.12 |
    | 16000 |    128 |    1 |  16128 |   25.849 |   618.98 |    3.667 |    34.91 |   29.516 |   546.42 |
    | 32000 |    128 |    1 |  32128 |   57.201 |   559.43 |    3.825 |    33.47 |   61.026 |   526.47 |
johndoughabout 2 hours ago
Getting ~44-40 tok/s on 24GB RTX 3090 (llama.cpp version 8884, same llama-batched-bench call):

    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    0.684 |  1462.61 |    2.869 |    44.61 |    3.553 |   317.47 |
    |  2000 |    128 |    1 |   2128 |    1.390 |  1438.84 |    2.868 |    44.64 |    4.258 |   499.80 |
    |  4000 |    128 |    1 |   4128 |    2.791 |  1433.18 |    2.886 |    44.35 |    5.677 |   727.11 |
    |  8000 |    128 |    1 |   8128 |    5.646 |  1416.98 |    2.922 |    43.80 |    8.568 |   948.65 |
    | 16000 |    128 |    1 |  16128 |   11.851 |  1350.10 |    3.007 |    42.57 |   14.857 |  1085.51 |
    | 32000 |    128 |    1 |  32128 |   25.855 |  1237.66 |    3.168 |    40.40 |   29.024 |  1106.96 |
Edit: Model gets stuck in infinite loops at this quantization level. I've also tried Q5_K_M quantization (fits up to 51968 context length), which seems more robust.
cpburns2009about 1 hour ago
~25-26 tok/s with ROCm using the same card, llama.cpp b8884:

    $ llama-batched-bench -dev ROCm1 -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    1.034 |   966.90 |    4.851 |    26.39 |    5.885 |   191.67 |
    |  2000 |    128 |    1 |   2128 |    2.104 |   950.38 |    4.853 |    26.38 |    6.957 |   305.86 |
    |  4000 |    128 |    1 |   4128 |    4.269 |   937.00 |    4.876 |    26.25 |    9.145 |   451.40 |
    |  8000 |    128 |    1 |   8128 |    8.962 |   892.69 |    4.912 |    26.06 |   13.873 |   585.88 |
    | 16000 |    128 |    1 |  16128 |   19.673 |   813.31 |    4.996 |    25.62 |   24.669 |   653.78 |
    | 32000 |    128 |    1 |  32128 |   46.304 |   691.09 |    5.122 |    24.99 |   51.426 |   624.75 |
GrinningFoolabout 3 hours ago
128GB (112 GB avail) Strix AI 395+ Radeon 8060x (gfx1151)

llama-* version 8889 w/ rocm support ; nightly rocm

llama.cpp/build/bin/llama-batched-bench --version unsloth/Qwen3.6-27B-GGUF:UD-Q8_K_XL -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000

    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    2.776 |   360.22 |   20.192 |     6.34 |   22.968 |    49.11 |
    |  2000 |    128 |    1 |   2128 |    5.778 |   346.12 |   20.211 |     6.33 |   25.990 |    81.88 |
    |  4000 |    128 |    1 |   4128 |   11.723 |   341.22 |   20.291 |     6.31 |   32.013 |   128.95 |
    |  8000 |    128 |    1 |   8128 |   24.223 |   330.26 |   20.399 |     6.27 |   44.622 |   182.15 |
    | 16000 |    128 |    1 |  16128 |   52.521 |   304.64 |   20.669 |     6.19 |   73.190 |   220.36 |
    | 32000 |    128 |    1 |  32128 |  120.333 |   265.93 |   21.244 |     6.03 |  141.577 |   226.93 |
More directly comparable to the results posted by genpfault (IQ4_XS):

llama.cpp/build/bin/llama-batched-bench -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000

    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    2.543 |   393.23 |    9.829 |    13.02 |   12.372 |    91.17 |
    |  2000 |    128 |    1 |   2128 |    5.400 |   370.36 |    9.891 |    12.94 |   15.291 |   139.17 |
    |  4000 |    128 |    1 |   4128 |   10.950 |   365.30 |    9.972 |    12.84 |   20.922 |   197.31 |
    |  8000 |    128 |    1 |   8128 |   22.762 |   351.46 |   10.118 |    12.65 |   32.880 |   247.20 |
    | 16000 |    128 |    1 |  16128 |   49.386 |   323.98 |   10.387 |    12.32 |   59.773 |   269.82 |
    | 32000 |    128 |    1 |  32128 |  114.218 |   280.16 |   10.950 |    11.69 |  125.169 |   256.68 |
cpburns200929 minutes ago
Results are nearly identical running on a Strix Halo using Vulkan, llama.cpp b8884:

    $ llama-batched-bench -dev Vulkan2 -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    3.288 |   304.15 |    9.873 |    12.96 |   13.161 |    85.71 |
    |  2000 |    128 |    1 |   2128 |    6.415 |   311.79 |    9.883 |    12.95 |   16.297 |   130.57 |
    |  4000 |    128 |    1 |   4128 |   13.113 |   305.04 |    9.979 |    12.83 |   23.092 |   178.76 |
    |  8000 |    128 |    1 |   8128 |   27.491 |   291.01 |   10.155 |    12.61 |   37.645 |   215.91 |
    | 16000 |    128 |    1 |  16128 |   59.079 |   270.83 |   10.476 |    12.22 |   69.555 |   231.87 |
    | 32000 |    128 |    1 |  32128 |  148.625 |   215.31 |   11.084 |    11.55 |  159.709 |   201.17 |
amstanabout 2 hours ago
you should try vulkan instead of rocm. it goes like 20% faster.
cpburns200928 minutes ago
For this model results are identical. In my experience it can go either way by up to 10%.
MrDrMcCoyabout 1 hour ago
Is that based on recent experience? With "stable" ROCm, or the (IMHO better) releases from TheRock? With older or more recent hardware? The AMD landscape is rather uneven.
endymi0nabout 3 hours ago
at this trajectory, unsloth are going to release the models BEFORE the model drop within the next weeks...
danielhanchenabout 3 hours ago
Haha :)
cpburns200926 minutes ago
Do you get early access so you can prep the quants for release?
Mr_Eri_Atlovabout 3 hours ago
Excited to try this, the Qwen 3.6 MoE they just released a week or so back had a noticeable performance bump from 3.5 in a rather short period of time.

For anyone invested in running LLMs at home or on a much more modest budget rig for corporate purposes, Gemma 4 and Qwen 3.6 are some of the most promising models available.