Dispersion loss counteracts embedding condensation in small language models

EE-Reverance about 5 hours ago 5 commentsRead Article on chenliu-1996.github.io

DE version is available. Content is displayed in original English for accuracy.

⚡ Community Insights

Discussion Sentiment

50% Positive

Analyzed from 111 words in the discussion.

Discussion (5 Comments)Read Original on HackerNews

aetherspawn•about 4 hours ago

It makes sense to me that distributing across more parameters results in models that can be quant more heavily (information theory - more bits available)

I wonder if anyone has figured out how the information is compressed and calculated the amount of information an LLM can hold depending on its size

woadwarrior01•about 3 hours ago

> I wonder if anyone has figured out how the information is compressed and calculated the amount of information an LLM can hold depending on its size

You might want to look at Physics of Language Models[1]. IIRC, the authors estimate it to be ~2 bits of factual knowledge per parameter.

[1]: https://physics.allen-zhu.com/

lwansbrough•about 4 hours ago

Anyone with a billion dollars want to try this and report back?

nullc•about 3 hours ago

From the paper it appears that it's probably more useful on small-ish models.

lwansbrough•about 2 hours ago

What does it cost to train a model like 1-bit Bonsai? Anyone know?