Advertisement
Advertisement
β‘ Community Insights
Discussion Sentiment
83% Positive
Analyzed from 1236 words in the discussion.
Trending Topics
#model#models#gemma#more#pro#moe#running#https#coding#same
Discussion Sentiment
Analyzed from 1236 words in the discussion.
Trending Topics
Discussion (28 Comments)Read Original on HackerNews
I'm really surprised how that was not obvious.
Also, instead of limiting context size to something like 32k, at the cost of ~halving token generation speed, you can offload MoE stuff to the CPU with --cpu-moe.
I tried running the same on an M3 Max with less memory, but couldn't increase the context size enough to be useful with Opencode.
It's also easy to integrate it with Zed via ACP. For now it's mostly simple code review tasks and generating small front-end related code snippets.
Now both codex and opencode seem to work.
FYI the latest iteration is here: https://huggingface.co/Jackrong/Qwopus3.5-27B-v3
Gonna run some more tests later today.
As you have so much RAM I would suggest running Q8_0 directly. It's not slower (perhaps except for the initial model load), and might even be faster, while being almost identical in quality to the original model.
And just to be sure: you're are running the MLX version, right? The mlx-community quantization seemed to be broken when I tried it last week (it spit out garbage), so I downloaded the unsloth version instead. That too was broken in mlx-lm (it crashed), but has since been fixed on the main branch of https://github.com/ml-explore/mlx-lm.
I unfortunately only have 16 GiB of RAM on a Macbook M1, but I just tried to run the Q8_0 GGUF version on a 2023 AMD Framework 13 with 64 GiB RAM just using the CPU, and that works surprisingly well with tokens/s much faster than I can read the output. The prompt cache is also very useful to quickly insert a large system prompt or file to datamine although there are probably better ways to do that instead of manually through a script.
On the 48GB mac - absolutely. The 24GB one cannot run Q8, hence why the comparison.
> And just to be sure: you're are running the MLX version, right?
Nah, not yet. I have only tested in LM Studio and they don't have MLX versions recommended yet.
> but has since been fixed on the main branch
That's good to know, I will play around with it.
1) Pin to an earlier version of codex (sorry) - 0.55 is the best experience IME, but YMMV (see https://github.com/openai/codex/issues/11940, https://github.com/openai/codex/issues/8272).
2) Use the older completions endpoint (llama.cpp's responses support is incomplete - https://github.com/ggml-org/llama.cpp/issues/19138)
It's interesting - imo we'll soon have draft models specifically post-trained for denser, more complicated models. Wouldn't be surprised if diffusion models made a comeback for this - they can draft many tokens at once, and learning curves seem to top out at 90+% match for auto-regressive ones so quite interesting..
So yes, do purchase that new MacBook Pro.
Anyway, is it possible that this may be what lies behind Gemma 4's "censoring"? As in, Google took a deliberate choice to focus its training on certain domains, and incorporated the censor to prevent it answering about topics it hasn't been trained on?
Or maybe they're just being sensibly cautious: asking even the top models for critical health advice is risky; asking a 32B model probably orders of magnitude moreso.
That isn't the explanation. The censorship here is explicitly imposed by Google. Your explanation would make sense if various other rare domains were also censored, but they aren't, so it doesn't.
> asking even the top models for critical health advice is risky
Not asking, and living in ignorance, is riskier. For high-stakes questions, of course I'd want references that only an online model like ChatGPT or Gemini, etc. would be able to find.
That is a bad premise and a false dichotomy, because most medical questions are simple, with well-known standard answers. ChatGPT and Gemini answer such questions correctly, also finding glaring omissions by doctors, even without having to look up information.
As for the medical questions that are not simple, the ones that require looking up information, the model should in principle be able to respond that it does not know the answer when this is truthfully the case, implying that the answer, or a simple extrapoloation thereof, was not in its training data.
In fact, I started using it as a coding partner while learning how to use the Godot game engine (and some custom 'skills' I pulled together from the official docs). I purposely avoided Claude and friends entirely, and just used Gemma4 locally this week... and it's really helped me figure out not just coding issues I was encountering, but also helped me sift through the documentation quite readily. I never felt like I needed to give in and use Claude.
Very, very pleased.