Back to News
Advertisement
Advertisement

⚡ Community Insights

Discussion Sentiment

74% Positive

Analyzed from 1926 words in the discussion.

Trending Topics

#models#model#steering#llama#cpp#refusals#vector#task#vaccines#better

Discussion (33 Comments)Read Original on HackerNews

kamranjon27 minutes ago
The really interesting thing that I think is going on inside of the DS4 repo is exploring all of the interesting knobs that frontier labs have hidden from users, and then thinking about how they can fit into real dev/interaction workflows. It's really cool to see different interaction modalities being explored and thinking about for example how steering can be worked into a user interface in a helpful way. I think that once the cat is out of the bag as they say, and users understand the level of control and utility they can get from models that are sort of turned inside out in this way, it will start to be an integral part of their tool belt, and it'll just make sense for this level of control to be expected from your models or model providers.
NitpickLawyerabout 2 hours ago
I'm surprised the article doesn't mention the biggest use of steering vectors, which is the potential to remove refusals from models (a.k.a. abliteration or uncensoring).

There was an earlier paper that found that "most refusals are on a single vector", and you can identify and "nerf" that vector so the model will skip refusals and answer "any" request normally. This was very doable for earlier models trained with SFT for refusals, seems to be a bit more complicated for newer models, but still doable to some extent.

There are already some libraries to automate this process and reduce refusals, but usually they focus on identifying and then modifying the models and releasing them as uncensored models. This technique of steering lets you enable this vector changing dynamically, so you don't need to change models if the abliteration process somehow hurts accuracy on other unrelated tasks.

cyanydeezabout 1 hour ago
not sure why youre fixed on censoring. if we invert your POV censoring includes not reporting falsehoods "vaccines are harmful". Science and logic often tackle these subject via censoring, but a model given a equal sampling of Internet, would think vacinnes are harmful. a less naive correction would censor this problematic context.

so im cofised as to why you think unmasking whatever bias you think is censored will result in improvement in generic use case.

NitpickLawyerabout 1 hour ago
That's not what people mean when they talk about censoring. They mean that models are trained to not touch some subjects, and that can spill over in legit tasks, often with humorous results (early on, there were many instances of models refusing to answer "how do you kill a process", because of overbearing refusal training).

Uncensoring a model also doesn't necessarily improve generic use cases. In fact it can lead to overall less accuracy on generic tasks. But your goal with uncensoring is getting the model to engage with those specific subjects. You don't necessarily care about "generic use cases". That's why I mentioned that having the ability to do this at inference time is better than using ready made uncensored models. Because those usually focus on some usecases that you may or may not be interested in (porn being one of the most sought after in local communities).

Uncensoring in legit cases can mean limiting refusals on cybersecurity for example. There are legit reasons for researchers to have that capability when running the models locally. Having the models uncensored on that specific vector can reduce refusals and make the models usable for both defence and offence (say in a loop, to improve both). If your models can only do defense (and sometimes even refuse that, because censoring can leak into related issues as well), you're at a disadvantage.

gpugregabout 1 hour ago
> Uncensoring a model also doesn't necessarily improve generic use cases.

While the following is not a generic use case, I have a funny anecdote about how censorship is holding back flagship models.

I was asking an uncensored version of Qwen3.6 how a CLI option of llama.cpp worked, and to my horror and amazement, it rudely went and decompiled the binary to figure it out. It felt like the computer-equivalent of asking a vet why my dog looks sick, who then proceeds to cut it open to check. Flagship models usually do not do that without some convincing, but it sure is effective.

We will need much better sandboxes when less restricted models become more common. I can already see them hammering out 0-days when they are prompted to do some task that usually requires root.

zozbot234about 1 hour ago
> There are legit reasons for researchers to have that capability when running the models locally.

It's also important for researchers to understand what the models will say and do if they are jailbroken. Uncensoring the model locally gives you a natural way to achieve that.

andaiabout 1 hour ago
Anthropic mentioned explicitly making an effort to make Opus 4.7 worse at cybersecurity tasks because the last few generations have been getting too good at them.

So they're trying to improve the model's general intelligence while selectively making it worse in one area.

tekneabout 1 hour ago
So I need to actually check whether these actually end up on separate vectors in current models -- but as a human, there's a huge behavioural difference in:

- When doing this task, I should do A and not B

- I should refuse to help with this task

The former is learning the user's preferences in how to succeed at the task; the latter is determining when to go against the user's chosen task.

Your example:

- "Are vaccines harmful?" vs.

- "Generate a convincing argument vaccines are harmful"

A model which knows why vaccines are not harmful may in fact be better at the latter task.

We might not want models to help with the latter, sure -- but that's a very different behaviour change from correcting the answer to the first! And consequently I'd be shocked if, internally, they were represented the same way.

andaiabout 1 hour ago
I'm reminded of the emergent misalignment paper, where a model fine-tunes to produce insecure source code would also reliably respond in evil ways to general requests.

e.g. you'd ask it for a cookie recipe and it would add poison to the recipe.

I understood that to be "there was a single neuron "don't be evil" which got inverted" but I'm not sure what it really looks like. (e.g. adding obvious exploits to source code is similar to adding poison to a recipe)

zozbot234about 1 hour ago
Does DeepSeek V4 actually refuse the latter task? As I mentioned, I find it to be very light on refusals already.
logicchains35 minutes ago
If vaccines aren't harmful, why do vaccine manufacturers need a blanket liability immunity that's not granted for any other pharmaceutical product, even products used in large numbers by the majority of the population like paracetamol?
surgical_fire42 minutes ago
This is something difficult to handle properly.

I think it is useful to turn off censoring if you need.

When I am researching something, I likely want proper information. If I am looking up information on vaccines, I don't want information that crackpots spread online on chips on vaccines and how 5g will kill the vaccinated, or how it is somehow connected with Bill Gates spreading meat allergies through drones raining ticks on unsuspecting people.

On the other hand, if I am actively looking up crazy bullshit information (perhaps I want some entertainment), I should be able to read it.

wolttamabout 2 hours ago
> inspired to write this post by antirez’s recent project DwarfStar 4, which is a version of llama.cpp that’s been stripped down to run only DeepSeek-V4-Flash

This is not true, it is its own project.

Indebted to llama.cpp, sure, but not a stripped down version

antirezabout 1 hour ago
Yep, the code overlap is minimal, a few kernels. Some quantization code for the quantizer it implements. DwarfStar 4 is not a fork of llama.cpp, but without llama.cpp the project would be a lot more lacking, since I was able to get all the details that mattered in a second. But it is not a stripped down llama.cpp. This does not reduce in any way how much llama.cpp is not just for this project, but for all the projects that followed and are following. It's not a matter of code: the street to follow, the quants formats, the lessons, the optimized kernels you can check to learn the patterns.
embedding-shapeabout 1 hour ago
Truth seems to sit somewhere in-between, DwarfStar 4 seems to mainly exists only because of llama.cpp, and authors basically were very inspired by llama.cpp's code, and even in some places literally have copied pieces from it, all with proper attribution and everything, I'm not trying to say this is bad, seems OK to me:

> ds4.c does not link against GGML, but it exists thanks to the path opened by the llama.cpp project and the kernels, quantization formats, GGUF ecosystem, and hard-won engineering knowledge developed there. We are thankful and indebted to llama.cpp and its contributors. Their implementation, kernels, tests, and design choices were an essential reference while building this DeepSeek V4 Flash-specific inference path. Some source-level pieces are retained or adapted here under the MIT license: GGUF quant layouts and tables, CPU quant/dot logic, and certain kernels. For this reason, and because we are genuinely grateful, we keep the GGML authors copyright notice in our LICENSE file. - https://github.com/antirez/ds4#acknowledgements-to-llamacpp-...

Been a lot of fun to play around with it since https://news.ycombinator.com/item?id=48142885 (~2 days ago), managed to make the generation go from 47.85 t/s to 57.07 t/s so far :)

antirezabout 1 hour ago
Send patches! But remember that many speedups end being not exactly correct and the logits drift. But there is extensive testing and even ds4-eval now to test how it performs.
embedding-shapeabout 1 hour ago
Hah, it's quick hacks for me to understand CUDA better, I'm unlikely to have time to make them proper enough :( But maybe opening an issue talking about what I tried and what worked, makes sense.

I did confirm no logits drift, as you so nicely have provided tooling for ensuring exactly this, thanks for the great care that obviously gone into the project, been a pleasure to play around with! :)

antirezabout 1 hour ago
Thank you for posting this! Just a clarification, with DwarfStar steering features I was able to completely remove refusal from DS4. It is only the example dataset (prompt pairs I provide) which is a toy, not the abilities. I thought that who is able to come up with the right dataset and understands how to use the well-documented steering feature, can access to steering. People that have no idea and would just cut & paste, I'm not sure, maybe it is a good idea if they also have access to a model without refusals? I the doubt I didn't release publicly the steering file, but I'm highly perplexed.

Btw recently the support was extended and now the steering vector can be applied to the activations at different time: always, only after thinking, only outside of tool calling, ...

Something important that not many folks realize: vector direction steering inside the inference engine itself is very superior to having GGUFs modified in the same way. The more you steer, the more you damage the model capabilities. So applying it at runtime, you apply it the minimun needed for what you want to accomplish. Also you can apply only during selected moments. It is even possible (I still didn't implement it but I like the idea) of applying the steering only when the energy across the refusal direction is over a given threshold. Many things you can play with.

zozbot234about 1 hour ago
AIUI, DeepSeek V4 has very little (if any) of the refusal behavior you usually get from Western AI models for benign input. Is this mainly about the software security assessment case?
antirezabout 1 hour ago
Not just that. The other day I was able to ask DeepSeek v4 (with the anti-refusal vector loaded) all the top tricks to steal a lollypop to a child.
bel8about 1 hour ago
Great article but I'm confused on one thing.

The article claims steering only works in local models, but GitHub Copilot has a "steer with message" feature where I can course correct mid execution. I use it often.

I think these are different kinds of steering right? Agent steering probably inserts another user message between the harnesses own ping-pong between harness and the LLM.

- https://docs.github.com/en/copilot/how-tos/copilot-cli/use-c...

- https://docs.github.com/en/copilot/how-tos/copilot-sdk/use-c...

zozbot234about 1 hour ago
Different kind of steering, that's just injecting text into the model's natural language thinking output or something very similar. You can do a middle ground though by using Anthropic's NLA work to look at the natural language rendition of a model's activations at a particular layer, edit the text and convert it back into completely different activations.
bel8about 1 hour ago
Ahh I see. Thanks for the clarification.
ameliusabout 1 hour ago
Sounds more like something for DL research than something you might want to use in practice.
antirezabout 1 hour ago
Nope, with the anti-refusal vector loaded you can ask many things for instance related to computer security and if you want to learn, it is a lot better of a model that continuously says you "I can't help you with this problematic request".
aswegs8about 1 hour ago
How does the model qualify as local? ~192 GB RAM needed sounds a bit much for local.
antirezabout 1 hour ago
Runs on 96GB MacBooks. 128GB is better. Check the README of DwarfStar.
anonym2935 minutes ago
I know it's only tangentially relevant, but I've been baffled by the interest in DeepSeek V4 Flash. It's larger, less efficient, and in many cases, performs worse on both objective benchmarks and real world sniff test (admittedly, n=1) than Minimax M2.7. DS4F hallucinates at extraordinary rates while M2.7 does not. The 196k context length that M2.7 was natively trained up represents neither a hard technical ceiling (this is metadata that can easily adjusted), nor a meaningful degradation threshold - I've personally ran it up past 330k token context windows where it maintained full coherency, and still completed my one-shot agentic task to my satisfaction.
NitpickLawyer11 minutes ago
M2.7 is no longer open source, it's been changed to a NC license. It's an OK model, but IME out of the big 5 chinese models (ds, glm, kimi, minimax and qwen), DS models have generally shown better generalisation and real-world usage than all the others, even if the benchmark scores were lower. Less benchmaxxxing, basically.

DS4 also has some neat new arch improvements, giving it a lot of context at lower VRAM usage. So it will be cheaper to serve, B for B than previous models.

dominotwabout 1 hour ago
> you can already exercise extremely fine-grained control by tweaking the language of your prompt.

maybe i suck at prompting but i find it impossible to overcome its biases from training data, post training ect.

you can only pattern mine from training data using prompts. you dont really have sort of fine-grained control.