DE version is available. Content is displayed in original English for accuracy.
All content is based on Andrej Karpathy's "Intro to Large Language Models" lecture (youtube.com/watch?v=7xTGNNLPyMI). I downloaded the transcript and used Claude Code to generate the entire interactive site from it — single HTML file. I find it useful to revisit this content time to time.

Discussion (35 Comments)Read Original on HackerNews
> you end up with about 44 terabytes — roughly what fits on a single hard drive
No normal person would think that 44 TB is a usual hard drive size (I don't think it even exists ? 32TB seems the max in my retailer of choice). I don't think it's wrong per se to use LLM to produce cool visualization, but this lack of proof reading doesn't inspire confidence (especially since the 44TB is displayed proheminently with a different color).
For SSDs record seems to be 245 TB - [2]
[1] - https://www.seagate.com/stories/articles/seagate-delivers-in...
"Seagate’s Mozaic™ 4+ hard drives supporting capacities up to 44TB are now shipping in volume to two leading hyperscale cloud providers."
[2] - https://fudzilla.com/kioxia-showcases-245-76tb-lc9-enterpris...
From around the 2:20 mark he says:
“[…] actually ends up being only about 44 TB of disk space. You can get a USB stick for like a TB very easily, or I think this could fit on a single hard drive almost today”
So it’s just slightly altered from what was said in the original video. And the LLM rewritten version of it also says “roughly” where he said “almost”, and I guess 44 TB is pretty roughly or pretty almost 32 TB. Although I’d still personally probably put it as “can fit on a pair of decently sized hard drives today” (for example across two 24 TB drives).
Regardless, it’s close enough to what was said in the source video that it’s not something the LLM just made up out of nowhere.
Still obviously crazy to consider that any kind of "average" or common size, but certainly not outrageous, especially for someone working in that field.
What does the input side of the neutral network look like? Is it enough bits to represent N tokens where N is the context size? How does it handle inputs that are shorter than the context size?
I think embedding is one of the more interesting concepts behind LLMs but most pages treat it as a side note. How does embedding treat tokens that can have vastly different meanings in different contexts - if the word "bank" were a single token, for example, how does embedding account for the fact that it can mean river bank or money bank? Do the elements of the vector point in both directions? And how exactly does embedding interact with the training and inference processes - does inference generate updated embeddings at any point or are they fixed at training time?
(Training vs inference time is another thing explanations are usually frustrating vague on)
When it comes down to the meaning of "bank" embedding, it cannot be interpreted directly, however, you can run statistical analysis on embeddings (like PCA). If we were to say, the embedding for "bank" contains all possible meaning of this word, the particular one is inferred not by the embedding layer, but via later attention operations that associate this particular token with the other tokens in the sequence (e.g. self attention).
The sequence is of variable length. It was one of the "early" problem in sequence modelling : how to deal with input of varying length with neural networks. There is a lot of literature about it.
This is the source of plenty of silent problems of various kind :
- data out of distribution (short sequence vs long sequences may not have the same performance )
- quadratic behavior due to data copy
- normalization issues
- memory fragmentation
- bad alignment
One way of dealing with it is by considering a variable length sequence as a fixed sized sequence but filling with zeros the empty elements and having some "masks" to specify which elements should be ignored during the operations.
----
Concerning the embedding having multiple semantic meaning, it is best effort, all combinations of behavior can occur. The embedding layer is typically the first layer and it convert the integer from the token into a vector of embedding dimension of floating point numbers. It tries its best to separate the meaning to make the task of the subsequent layers of the neural network easier. It's shovelling the shit it can't handle down to road for the next layers to deal with it.
For experiments you can try to merge two tokens into one or into <unknown> token, in order to free some token for special use without having to increase the size of the vocabulary.
Embeddings some times can be the average of the disambiguated embeddings. Some times can be their own things.
In addition to embeddings, you can often look at the inner representation at a specific depth of the neural network. There after a few layers the representation have usually been disambiguated based on the context.
The last layer is also specially interesting because it is the one used to project back to the original token space. Sometimes we force the weights to be shared with the embedding layer. This projection layer usually can't use context so it must have within itself all necessary information to very simply map back to token space. This last representation is often used as a full sequence representation vector which can be used for subsequent more specialized training task.
Embedding weights are fixed after training, but in-context learning occur during inference. The early tokens of the prompt will help disambiguate the new tokens more easily. For example <paragraph about money> bank vs <paragraph about landscape> bank vs bank will have the same input embedding for the bank token, but one or two layer down the line, the associated representation will be very different and close to the appropriate meaning.
Genuine piece of feedback, as soon as I see those gradients + quirks. My perception immediately becomes - you put no effort into finding your own style, therefore you will not have put effort into creating this website.
Right on both counts
So plagiarism is even explicit now. A stolen database relying on cosine similarity to parse the prompts.
Why doesn't The Pirate Bay have a $1 trillion valuation?
Hard pass on AI slop. First - principally as it brings no real value, anyone can iterate over some prompts to generate a version of this. Secondly - more specific - Don't you know that LLMs are particularly prone to make mistakes in summarising, where they make subtle changes in the wording which has much wider context impact?
If you insist on being the human part of a centaur, then at least do your human slave part - inspect the excremented "content", fix inconsistencies etc.
@dang, when is the 'flag as slop' button coming?
Same everywhere: avalanches of AI garbage and intellectual dishonesty. People claiming "I wrote this", then a look at the code shows massive slop and an author with no clue about the topic.
More worrying, this trend is creeping to all domains: "Nearly 75,000 tracks uploaded to Deezer are fully created using AI. That’s 44% of daily uploads, and more than 2 million per month. Back in June, the daily number was around 20,000."
https://www.vice.com/en/article/how-deezer-is-fighting-fraud...