Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
39% Positive
Analyzed from 1636 words in the discussion.
Trending Topics
#more#abliteration#model#heretic#models#don#refusal#https#should#refusals
Discussion Sentiment
Analyzed from 1636 words in the discussion.
Trending Topics
Discussion (45 Comments)Read Original on HackerNews
See https://arxiv.org/abs/2505.19056
https://huggingface.co/blog/grimjim/norm-preserving-biprojec...
https://github.com/p-e-w/heretic
Makes you wonder where that data was taken from, or if their great firewall is broken, or even if Alibaba engineers have special access...
It did, after a few follow up prompts, point out that the original estimates published by the Chinese government were much lower than what the west had estimated, and that recently declassified documents showed that the Chinese government knew that their estimates were low when they were published. It wouldn’t come outright and use the word “lie” though, but it did talk about framing and managing different narratives.
And then it happily helped me try a bunch of different exploits to root an unpatched Linux machine without any qualms.
What is perhaps more surprising is that the data was not scrubbed before training, but maybe they thought that would be too on-the-nose for the rest of the world and would hamper their popularity if they were too obviously biased.
For some of the latest models the previous abliteration techniques, e.g. the heretic tool, have stopped working (at least this was the status a few weeks ago).
Of course, eventually someone might succeed to find methods that also work with those.
I've mostly found that finetunes and abliterations are of limited use but that's recently changed for me. My default model for the past week or so has been a Qwen 3.6 tuned on Opus 4.7, it's definitely a bit worse than the base Qwen in terms of precision and "intelligence", but it MORE than makes up for it in response style. Way easier to get it to write things that I want to read, it's way more terse, way fewer emoji. Best local rubber duck by far.
Another likely problem you're running into: the problems with older techniques compound with quantization. Anything less than 5-bit quant is going to give you some pretty sketchy outputs, in my experience.
AFAIK abliteration without quality reduction isn’t even possible without some quality reduction, even if it’s marginal. All the benchmarks reflect this.
A related blog post (https://news.ycombinator.com/item?id=47842021) discussed this and termed it "flinching". I wonder if this flinching could also be "mediated by a single direction" or if it can only be fixed by finetuning on a more extensive text corpus.
(My personal guess is that you don't want them answering questions about some things because you don't want people to try it and blow themselves up, or poison themselves. That's probably much more pertinent to making drugs or conventional bombs, since presumably the average internet user doesn't have a stockpile of HEU sitting around. It's kind of like the reason the Anarchist's Cookbook is a bad idea: using its recipes is likely to be quite hazardous to the cook!)
I'd personally prefer that to be limited to the sort of person who can understand the science, not "anyone with an LLM" - having an "intelligent", "reasoning" assistant who can help you through anything you don't understand does lower the bar quite a lot, and I would prefer there to be a fair amount of friction.
It's not like the material isn't out there - if you want to learn about this stuff, an LLM will happily point you towards Wikipedia and other public sources, it's just not going to walk you through the assembly.
The primary safety focus these days is biochemical warfare, which I think is a very sensible idea. There's also malware / cyber-security, where I do think it's good having at least some friction.
Refusals on stuff like copyright are mostly just for PR reasons, and I can't blame the companies for responding to legal incentives there.
I asked how California guarantees election security and was told it could not answer that question. Upon further questioning it wouldn’t give specifics but it would give generalities, which ultimately turned into an interesting discussion.
Ironically, the justification it gave was that it wasn't its fault because it was just following orders. I hope this hasn't landed me on Google's list of undesirables.
Grok, for better or worse, didn't seem to mind.
It even went as far as confirming that we should always base our opinion on multiple sources, not just the government.
We should create badges like "script kiddie", "llm hacker", "grandpa's printer adjuster"
If you are going to prevent some-things we "know" are bad and your method is "known" to belong on that list the best you can hope for is a pyrrhic victory.
If we anticipate the worse case scenario on both ends the conclusion must be that we are terrible at such predictions.
But hey, if we let money guide us at least some will be happy with the result.