The sample efficiency black hole

That's an apples-to-oranges comparison, a direct equivalent would be the sample efficiency of in-context learning. It's entirely comparable, being able to work off just a few samples or even just one. It's a very crude mechanism, but it actually teaches new programming languages to the model just fine! Of course it's a subject to all limitations of the attention and context rot and other quirks.

Instead of chasing some kind of holy grail that would alchemically make it learn complex abstractions from insufficient data, I suggest finding ways to make that stepped ladder from bulk data to pretraining to post-training to in-context learning to be a gradient, and to be a lot deeper.

>Many billions of years of evolution is our pre-training

Not just that, but also compression of knowledge happening in the society. Which is the majority of "our" intelligence. Let's also not forget that we are fairly specialized, and you're actually comparing against the entire society, a huge generational system, not a single individual. Pure biological capabilities of a single individual are pretty mediocre, if you look at a caveman you'll find his efficiency to be a lot less impressive despite having the same biological capacity. For example your ability to learn new math entirely depends on your prior math knowledge, which is a result of a humongous amount of gradient descent performed by generations of mathematicians with ever-increasing rate and abstraction complexity. A caveman won't be able to learn it that easily.

>Our genome is 3GB, about 1-2% protein coding. That is just not enough space to store the model parameters that are supposedly pretrained

1. A lot more information is encoded in implicit ways: our entire environment, our parents (genome can't reproduce itself!), and only then the genome.

2. Evolution had a lot more "compute" to spend on cramming the same stuff into a lot fewer dimensions.

>These comparisons are not including the multimodal data we see in our lifetimes. If you include all this sensory information, we’re probably in the 10s to 100s of billions of tokens range from birth to adulthood

What is this number, where does it come from? Tokenization is an artifact of current architectures, a step on that discrete compression ladder. Latent throughput of the brain is in the range of double digit bits per second, this is already pushing it to the max, actual amount of the information is a lot less.

The sample efficiency black hole

⚡ Community Insights

Discussion (1 Comments)Read Original on HackerNews