FR version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
45% Positive
Analyzed from 485 words in the discussion.
Trending Topics
#scale#models#results#don#experiments#llm#wrote#ask#six#failed

Discussion (5 Comments)Read Original on HackerNews
Playing with agents can get expensive quickly, please be careful.
The honest headline results: 14% infill recovery where autoregressive models score ~0 (they can't condition on text after the blank), 7.5% repetition-loop rate vs 37.5% for the teacher, and a genuinely negative result I think is the most useful part: six different self-correction methods all failed at this scale, while a 300k-param external critic head detects errors far above chance. Small models don't doubt; they rationalize.
Weights are open: https://huggingface.co/devnull37/hr-diffuse-1-nano. Happy to answer anything about the architecture, the failed runs.
I'm pretty busy, so I only skimmed the article, but it's actually really interesting, and also informative as I'm not familiar with diffusion models. Maybe I'll some ask questions/write later. I do want to encourage you, but, honestly the websites are a bit over the top and there's no way to know how much human input actually went in to them.
Experimental science is very messy, as you've learnt. Agreed with the other commenter, there's value (for others and especially yourself) in writing down what went wrong, and the things in the "Small models cannot judge themselves" is so reminiscent of failure modes I've experienced myself. There are usually awful or subtle bugs, training just doesn't work, and even if the results are "interesting" rather than "bad", it can still be incredibly difficult to decide what to conclude from them. To distill knowledge from observations/experiments is the problem of science. You read papers about experiments seem neat and the results profound, but the truth is they're probably a mess too and the evidence for the conclusions is probably a lot weaker than it looks; ML experiments can be unreproducible too.
I suggest that you were running experiments at too large a scale given your resources: you should try to sort out these critical issues on a smaller scale. Yes, the painful problem with ML is that things change qualitatively with scale, you just don't know if a larger scale will fix your issues. But most of these bugs didn't need scale to discover. Think about how you could have more easily discovered them.
Sorry to tell you that your comment was dead (silently blocked, invisible to most users) until I vouched for it. Don't be discouraged from posting on HN. Clearly both you're a real person, and you wrote this with an LLM (quite understandably), but people are really put off by text that smells LLM generated, and it's really easy to tell. HN is flooded with LLM comments lately, they go dead. You can use an LLM to help write, but don't let it determine the content, be genuine, and make sure it doesn't read like one. They can write in any style.