Back to News
Advertisement
Advertisement

⚔ Community Insights

Discussion Sentiment

85% Positive

Analyzed from 756 words in the discussion.

Trending Topics

#model#ball#target#base#output#more#tokens#faster#models#diffusion

Discussion (26 Comments)Read Original on HackerNews

thepasch•about 3 hours ago
If I’m reading this right, this is pretty wild. They turned a Qwen autoregressor into a diffuser by using a bunch of really clever techniques, and they vastly outperform any ā€œnative diffuser,ā€ actually being competitive with the base model they were trained from. The obvious upside here is the massive speedup in generation.

And then through a LoRA adapter, you can ground the diffuser on the base model’s distribution (essentially have it ā€œcompareā€ its proposals against what the base model would’ve generated), which effectively means: exact same byte-for-byte output for the same seed, just roughly twice as fast (which should improve even more for batched tasks).

I’m not an expert, more of a ā€œpracticing enthusiast,ā€ so I might be missing something, but at first glance, this reads super exciting to me.

awestroke•about 2 hours ago
I don't understand how you can compare against the base model output without generating with the base model, in which case what's the point?
qeternity•about 2 hours ago
I haven't read TFA yet but a common technique is speculative decoding where a fast draft model will generate X tokens, which are then verified by the larger target model. The target model may accept some Y <= X tokens but the speedup comes from the fact that this can be done in parallel as a prefill operation due to the nature of transformers.

So let's say a draft model generates 5 tokens, all 5 of these can be verified in parallel with a single forward pass of the target model. The target model may only accept the first 4 tokens (or whatever) but as long as the 5 forward passes of the draft model + 1 prefill of the target model is faster than 4 forward passes of the target, you will have a speedup while maintaining the exact output distribution as the target.

Balinares•33 minutes ago
Isn't that exactly how draft models speed up inference, though? Validating a batch of tokens is significantly faster than generating them.
a1j9o94•about 2 hours ago
You would only use the base model during training. This is a distillation technique
anentropic•about 1 hour ago
presumably that happens at training time?

then once successfully trained you get faster inference from just the diffusion model

andsoitis•about 5 hours ago
Is anyone here experimenting seriously with Diffusion for text generation? I’d love to learn about your experiences!
recsv-heredoc•about 5 hours ago
https://www.inceptionlabs.ai/

This startup seems to have been at it a while.

From our look into it - amazing speed, but challenges remain around time-to-first-token user experience and overall answer quality.

Can absolutely see this working if we can get the speed and accuracy up to that ā€œgood enoughā€ position for cheaper models - or non-user facing async work.

One other question I’ve had is wondering if it’s possible to actually set a huge amount of text to diffuse as the output - using a larger body to mechanically force greater levels of reasoning. I’m sure there’s some incredibly interesting research taking place in the big labs on this.

IanCal•about 4 hours ago
The overall speed rather than TTFT might start to be more relevant as the caller moves from being a human to another model.

However quality is really important. I tried that site and clicked one of their examples, "create a javascript animation". Fast response, but while it starts like this

``` Below is a self‑contained HTML + CSS + JavaScript example that creates a simple, smooth animation: a colorful ball bounces around the browser window while leaving a fading trail behind it.

<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>JavaScript Bounce Animation</title> <style> body, html { margin: 0; padding: 0;

```

the answer then degrades to

``` radius: BALL_RADIUS, color: BALL_COLOR, traivD O] // array of previous {x,y} positions }; ```

Then more things start creeping in

``` // 3⃣ Bounce off walls if (ball.G 0 ball.radius < 0 || ball.x + ball.radius > _7{nas.width) { ball.vx *= -1; ibSl.x = Math.max(ball.radius, Math.min(ball.x, canvbbF4idth - ball.radius)); } if

```

and the more it goes on the worse it gets

``` Ho7 J3 Works 0 Atep | Description | ```

and

``` • prwrZ8}E6on 5 jdF wVuJg Ar touc> 2ysteners ,2 Ppawn \?) balls w>SFu the 8b$] cliM#]9 ```

This is for the demo on the front page, so I expect this is a pretty good outcome compared to what else you might ask.

cataflutter•about 4 hours ago
Weird; I clicked through out of curiosity and didn't get any corruption of the sort in the end result.

I also asked it some technical details about how diffusion LLMs could work and it provided grammatically-correct plausible answers in a very short time (I don't know the tech to say if it's correct or not).

girvo•about 4 hours ago
It's being explored right now for speculative decoding in the local-LLM space, which I think is quite interesting as a use-case

https://www.emergentmind.com/topics/dflash-block-diffusion-f...

roger_•about 2 hours ago
DFlash immediately came to my mind.

There are several Mac implementations of it that show > 2x faster Qwen3.5 already.

moostee•about 5 hours ago
I have. It requires a distinct intuition compared to a normal language model. Very well suited to certain problems.
andsoitis•about 4 hours ago
Can you tell us more?
ramon156•about 3 hours ago
> 2025-04-12: Initial code release with training and inference support.

> 2025-04-12: Released I-DLM-8B, I-DLM-32B, and I-DLM-8B-LoRA on HuggingFace.

Is this old already? Not saying that's a bad thing, since it seems very sophisticated. Just curious if there's an update

oersted•about 3 hours ago
It's clearly a typo on the year, April 12 was two days ago, a quick check in HuggingFace shows that they were uploaded 5 days ago.
scotty79•about 1 hour ago
So can you just use this and have a faster Qwen32b?

https://huggingface.co/yifanyu/I-DLM-32B/tree/main

simianwords•about 3 hours ago
Can diffusion models have reasoning steps where they generate a block, introspect and then generate another until the output is satisfactory?
moeadham•about 3 hours ago
Well, you can take the output of a first pass and pass it back through the model like AR ā€œreasoningā€ models do at inference time.
simianwords•about 2 hours ago
Yes and has this been tried?
Topfi•about 1 hour ago
Yes, Mercury 2 is a reasoning model [0].

[0] https://docs.inceptionlabs.ai/get-started/models#mercury-2