DE version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
75% Positive
Analyzed from 4989 words in the discussion.
Trending Topics
#model#models#qwen#more#https#max#using#pro#open#unsloth

Discussion (200 Comments)Read Original on HackerNews
Note that a perfect "non-hallucination rate" is rather meaningless as such tests can contain human hallucinations.
It means the model aligns with the possibly-true, possibly-false beliefs of the group that made the test.
https://artificialanalysis.ai/evaluations/omniscience?models...
(had to add it to the chart, wasn't displayed by default. is it the lowest rate in the datasetor no?)
So I feel like that's exactly the right metric and the way to track it wrt hallucinations.
This is not an open model
I tried the qwen3.6-27b Q6_k GUFF in llama.cpp and LM Studio on my M2 MacBook Pro 32GB machine last week, and I barely get a token a second with either.
What sort of speed should I be expecting?
I tried some of the Llama 3 34b (nous-capybara?) models two years ago with llama.cpp, and I seem to remember getting a few tokens a second then, so not sure if I've got something completely mis-configured, or I just have unreasonable expectations.
Or maybe qwen 3.x is slower for some reason? (Is it mixture of experts?)
I'm not expecting it to be instant, but what I'm currently seeing is not really usable.
- A 27B "dense" model
- A 35B "Mixture of Experts" model, which activates only 3B parameters for each token.
For your hardware, I strongly recommend `unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M`. I have an M1 Max with 32GB VRAM from 2021 that can read at ~300-500 tokens/sec and write at ~30 tokens/sec with llama-cpp's default settings, which is plenty fast. The 27B model can read ~70tok/sec and write ~5tok/sec.
The 35B MoE model technically takes slightly more memory but is much faster because it's doing 1/9th the work. It's not quite as "smart", but it's comparable.
Recommend https://www.reddit.com/r/LocalLLaMA/ as a great source for this type of discussion.
Obviously bigger != better but I don't know what the differences are.
On testing I've done on same-quant apples to apples, with F16/F16 (ie, unquantized) kv cache, 35B-A3B underperforms against 27B on anything even remotely complex. But yes, 35B-A3B can be like 3-4x faster on my hardware.
By Qwen's own admission, on any meaningful benchmark (ie, ones that involve logic, math, or tool calling), 27B performs like 122B-10B and 397B-A17B, but 35B-A3B is somewhere between 27B dense and 9B dense.
Also, MTP recently got merged in, so I'd suggest downloading Qwen 3.6 MTP (I assume you get it from unsloth) and updating your copy of llama.cpp, and adding `--spec-type draft-mtp --spec-draft-n-max 2` to your arguments.
https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/ https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/
Also, I recommend not quantizing kv cache, and if you do, only quantize v. Lowering model quant while also lowering context size to fit F16/F16 or F16/Q8_0 massively improves model performance for thinking models. Also, quantizing cache, either k or v, decreases speed by a lot on some hardware.
I have a 24gb 7900xtx, so I can fit >32k F16/F16 context with Qwen3.6-27B, but use unsloth's Q3_K_XL. This performs better than Q(4,5,6)_K_XL with v quantized.
(not sure of exact diagnosis/fix, but definitely look in that direction if you're still having this issue when you give it another shot)
Also, there are two stages - prompt processing, and token generation. Prompt processing is notoriously slow on Apple Silicon unfortunately. If you have large context (which includes system prompts, lots of tools loaded by a harness like Claude Code, OpenCode, etc) it can take minutes for prompt processing before you see the first output token. On the bright side, the tokens are cached between turns, so subsequent turns won't be so bad.
EDIT: I run with context wired at 64K
For comparison, I just ran a couple of quick benchmarks (default settings) with llama-bench:
Qwen3.6-35B-A3B at Q6_K_XL gave 858 t/s pp512 (prompt processing) and 43 t/s tg128 (token generation).
Qwen3.6-27B at Q4_K_XL gave 103 t/s pp512 and 8 t/s tg128.
That's the dense model, you probably want a mixture-of-experts (MoE) one.
Here's what you probably want instead: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF
For agent harnesses, opencode is okay, as is pi or even Zed's built in agent panel. Claude code "works" with ANTHROPIC_BASE_URL=http://localhost:8080/v1, but is very chatty (the default system prompt burns 20k tokens). Crush (from the charm-bracelet folks) is particularly nice when starting out. I've personally converged on pi-agent under an otherwise-mostly-default setup. You can ask qwen to customize pi or write you an extension which helps a little.
You'll need to add `http://localhost:8080/v1` as an OpenAI-compatible model provider in your coding harness with any API key (doesn't matter) and any model identifier (doesn't matter with llama-cpp).
Note that pi doesn't have permissions. Everything is permitted. The hundred hungry ghosts you've trapped in a jar WILL find a way to delete your home folder someday. That's what Man gets for summoning demons without casting a circle of protection first. Flying too close to the sun etc etc etc
Take backups and then go have fun. Hope this helps.
Maybe that's underselling it. It is quite a good model and might end up replacing a lot of the work I was sending to Sonnet 4.6.
Also, Sonnet 4.6 is almost certain a much bigger model so the performance differences aren't unexpected.
Totally understand why it may not be reasonable or in their best interest (and that the US is _absolutely_ not doing the same reflexively). But it would be lovely to be able to try these out on production workloads in earnest.
In an ideal world U.S. residents would use Chinese AI models and Chinese residents would use U.S. AI models.
Governments in both countries are collecting data for nefarious reasons. But the Chinese government has far less influence on a U.S. resident and vice versa.
We are all better off if our data is collected by a government halfway across the world instead of our own governments which hold incredible amounts of power over us.
If you use a service outside your country, I believe you could have all your code stolen and get hacked/exploited in a way that would be totally legal.
On the other hand, there's other models where the source is 100% open, the training data is known, and people have reproduced the same model from scratch, so while those trail behind, there's definitely an effort to make models more open and capable.
Sure, that is until each government's dataset is interesting enough to the other to facilitate a data-sharing agreement.
There's gotta be an internet "law" that says something like "Eventually, the data you volunteer to a benign 3rd party eventually winds up being used against you by someone". This is short-term thinking at it's finest.
China has more integration between intelligence and industry than many western countries, and it does present a higher risk of unwanted “tech transfer” to industry than running on oracle or Google or ms or Amazon does in the US.
DHS has long staffed full time agents in California to deal with foreign IP exfiltration - using qwen is like fast/easy mode for IP exfiltration: why make anyone get a job in your palo alto office when you can just send it to them in Hanzhou?
Upshot - If you have something proprietary you’re working on I would generally advise not to just direct send it to Alibaba.
This made me think of a Seinfeld episode: "I didn't know it was possible not to know that."
Even if they weren’t individually worried about their proprietary data being shared with Chinese domestic competitors or with government… their audit / security programs likely wouldn’t allow it for a _huge_ range of types of data.
That's exactly the fear, and why would it not be logistically feasible? The threat is definitely a bit overhyped, but China has a longstanding track record of aggressive corporate espionage.
(As a reference, DeepSeek v4 is severely throttled on these proxy services.)
I've created a 2.54BPW quant that fit on my hardware with 128k context, 20 tps tg and 200tps pp, while maintaining high scores on many benchmarks: https://huggingface.co/tarruda/Qwen3.5-397B-A17B-GGUF/discus...
Oddly enough, though, Qwen 3.6 35B A3B and Gemma got some really good reviews, despite being way smaller than any of these ones.
Qwen 3.5, 122B A10B: https://huggingface.co/unsloth/Qwen3.5-122B-A10B-GGUF
Qwen Coder Next, 80B A3B: https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF
It's kinda weird that DeepSeek V4 Flash is supposed to be 284B A13B, but shows up as 158B in HuggingFace, probably some weird bug: https://huggingface.co/unsloth/DeepSeek-V4-Flash and that's not even just Unsloth but like the official source too https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash (so also doesn't fit the category unless you get a heavily quantized version to run, but cool regardless)
Mistral Medium 3.5 is interesting because it's 128B but dense, so probably too slow for most folks: https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF
GPT-OSS, 120B A5B: https://huggingface.co/unsloth/gpt-oss-120b-GGUF
I’m on an M1 Max with 32GB VRAM, so I’m looking forward to the 27B or 35B-A3B models. Is dropping $5k for an RTX 6000 or a DGX Spark really the best option?
- Your RTX 6000 is closer to $10k now
- Sparks are creeping into the $4-5k range
- AMD Strix are ~3.5k
- Apple depends on chipset and memory. Sweet spot would be 128gb M3 Ultra, probably $6-8k but admittedly haven't been tracking closely. New M5 might come in the fall. You can get a new 128gb M5 Max laptop for ~5-6k today.
- a 4x3090 rig would take $5-6k
Every platform has tradeoffs, but it's mostly ecosystem, memory bandwidth, and power consumption. They're all slow. The best option is likely to rent hardware on Runpod. The RIO on self-hosting is very low unless you have a specific need or you're ok treating it as a hobby.
In October/2024 I got my Mac studio M1 ultra with 128G, IIRC it was ~$2500. With recent prices explosion, it has certainly gotten more expensive. https://frame.work/ is selling 128G strix halo mainboard for $2700, but you have to add storage and case.
Unfortunately, the prices rose on these a lot, but unevenly. Beelink GTR 9 Pro is $4400, Framework Desktop is ~$3500, for what is basically the exact same mainboard as a Bosgame M5 for $2800.
Apple's M5 Max is another attractive option. Apple silicon traditionally had great MBW and was good at TG, but struggled with PP, but the new neural engines in those GPU cores have made a big difference in a good way here.
Gorgon Halo is rumored for June announcement with Q4'26 release with basically +100 MHz clocks on Strix Halo, LPDDR5X-8533 instead of LPDDR5X-8000, but more importantly, 192 GB max instead of 128 GB.
I'd say it's better to wait for Gorgon Halo than to grab Strix Halo now. However, Medusa Halo, rumored for H2'27, is slated to have up to 26c Zen 6 (heterogeneous cores - kinds funny that AMD is heading towards these as Intel retreats from them), 48 CU of RDNA 5 instead of 40 CU RDNA 3.5, and a 384 bit bus w/ LPDDR6, which should make 256 GB at more like ~490-600 GB/s MBW, which will really make Strix and Gorgon Halo obsolete.
Also worth keeping an eye out for Serpent Lake (intel CPU + nvidia iGPU on a single board with unified memory, rumored for 2028-2029 iirc), and on the 160 GB Crescent Island Intel dGPU.
Even with LLMs, posts like this don't just fall out of a coconut tree. If you have a set of target benchmarks for your own model, then keeping "the set" of side-by-side comparable models is its own maintenance headache.
Realistically I assume they hope readers don’t notice the fine details.
The Qwen models are great for open weights but for every past release they haven’t performed as well as the benchmarks in my experience. They’re optimizing for benchmark numbers because they know it works.
The pool of people reading such articles while ignoring such details can't be big.
On Hacker News I wonder if most people even opened the article at all most times.
if they say it's 4.7 comparable, it anchors that into your head as the model to evaluate against.
Is this normal humans kicking the tires on a new model, or a few whales doing serious benchmarks?
Open-weight: Good enough for the majority of tasks, and I'm willing to spend a bit more time and effort steering towards my desired result.
4.7 is much better. But perception is a funny thing, once you think something is bad you start looking for it everywhere.
The setup I had to do was important and I had to compile koboldcpp with a few special params for my hardware, I mostly just had Claude figure it out. I don't remember everything I did now but it was very slow and would often stop mid task, it seems it was mostly a parsing issue. It made the model seem broken/dumb, but once I had all that settled I actually am able to use this how I use Claude Code. Disclaimer, I am pretty explicit with requirements, I imagine this fails more when you leave it to figure out things on its own but for my flow its pretty rad.
Currently setting it up as an automated agent now to pull Trello cards, create PRs for them, and move the card to be reviewed.
Command I am using to run: python koboldcpp.py \ --port 61514 --quiet --multiuser --gpulayers 999 --contextsize 262144 --quantkv 2 \ --usecublas normal --threads 4 --jinja --jinja_tools --jinja_kwargs '{"enable_thinking":true, "preserve_thinking":false}' \ --skiplauncher --model /data/models/Qwen3.6-27B-Q5_K_M.gguf --smartcache 5
It's very capable on almost any coding task I've thrown at it, and very good for easy-to-medium hard scripts, new code bases.
It struggles on some complex tasks in larger code bases, e.g. using to debug and fix bugs in llama.cpp it gets close to working code but often introduces errors. For such tasks its still very useful as a search/explore tool and drafting fixes.
Good balance of intelligence and speed.
I had a Google Pro account that I inherited from buying a Pixel 9 XL - it's free for a year after a flagship Pixel phone purchase. After a year they started charging for it, and i tolerated it, because Flash was usable in Antigravity for dumb auxiliary tasks that I did not want to waste GPT/Opus on. It had a separate generous quota from Gemini 3.1 Pro. Now with Flash 3.5 they combined the quotas with Pro, such that on a Google pro account you can work 4-5 hours per week in Flash. And by the way, 3.1 Pro is useless for programming, compared to Codex/Opus
I think they envision Pro plan as "just a taste of AI, enough to lure folks into the Ultra plan" but that won't work for me when Codex is half the price and DeepSeek 4 Flash is 1/10 of their price per task.
So I'll downgrade just enough to keep my Google Drive space. And use DeepSeek 4 as workhorse plus Codex or Copilot for advanced stuff.
https://marketplace.visualstudio.com/items?itemName=sst-dev....
It adds a button to VSCode to open a tab with opencode loaded. It's a bit better than just opening the CLI because it has some vscode integration.
With their $10/mo opencode go plan: https://opencode.ai/go
For my use it's about endless use of DS4 Flash on high setting. I find high better than max because it's less chatty.
The best thing is the speed. So many tokens per second.
edit: This is how it looks in action https://i.imgur.com/RNDXr07.png
> Oops! There was an issue connecting to Qwen3.6-Plus.
> Content Security Warning: The input text data may contain inappropriate content.
hey ChatGPT, how many civilians were killed in Gaza in the war since 2023?
> [one page of estimates from local and international sources with links]
https://cortecs.ai/serverlessModels
Europe's sense of superiority and actual global importance/relevance is assbackwards.
Hilarious thing to say when half this comment section is Americans giving so much of a fuck that they consider China-adjacent hosted models unusable due to the supposed risks. If what you were saying was true then those pragmatic Americans would just use whatever is most effective.
Tiananmen Square is the first place to start.
Similarly, try talking to Nemotron about Epstein and see how quickly it shuts down.
What do you mean? This is not self hosted, it's closed source. And any website that targets China or is hosted in China will probably censor Tiananmen Square.