RU version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
56% Positive
Analyzed from 1632 words in the discussion.
Trending Topics
#uuid#tokens#words#agent#llms#ids#token#uuids#llm#random

Discussion (54 Comments)Read Original on HackerNews
I sort of get the "problem", but the fact that this is even needed is stupid.
I feel like people just jam poorly specified input into LLMs and hope for the best. Then pile more tools on top when they don’t get what they want.
People call this exact process "vibe coding".
For this use case, our solution was just to use a slug for the filename, but we can control the uniqueness constraint on our backend.
It feels much like the random number generators in your operating system. The OS is responsible for providing applications with a source of entropy. In the same line of thinking maybe IDEs, agent frameworks, whatever you want to call it, should be responsible for providing some base functionality.
But this seems orthogonal to token usage, and if I was designing an "LLM-friendly UUID" it would have some additional checksum data, to detect transcription errors.
I can see this being useful when feeding raw table dump csvs into models, isomorphism means it's a simple pre-post processing step which could give you a cheap decrease of tokens and increase in accuracy.
I guess you’re another bot
> Where UUIDs cost ~23 tokens and get hallucinated by LLMs, id-agent produces memorable word-based IDs at ~14 tokens with equivalent collision resistance.
I would be surprised if this actually helped with hallucinations. Happy to be proven wrong though, and this seems like an easy experiment to run: just take a tiny model (below 1B) and have it transcribe a couple thousand ids in both formats, then check where it made more mistakes
If you're dealing with uid in -> uid out, where you're hoping to get the same uid out, intuitively the entropy would be greatly reduced anyways. Then the question becomes, are words conducive to keeping input->output consistent, given the way LLMs work (e.g. attention mechanism)? I could see it go either way, that's why I'm supporting the idea of running your experiment.
Your test with small models makes tons of sense. Would be interesting to graph to two approaches against model size and recency.
Why should an LLM even have these types of IDs anywhere in the prediction pipeline?
A random "-" separated words will fail the validation check.
Furthermore, this could be compressed even further with a dynamic legend of every UUID in the context. So UUID@Bravo and UUID@Delta would be the actual symbols in the context but dynamically replaced when calling tools.
1. LLMs might lack intrinsic entropy and reuse some UUIDs much more often.
2. Referential integrity is as important as collision resistance. An LLM must be able to reuse the correct id in the correct place.
On the other hand, using a dictionary for the ids helps with readability, but depending on the models strenghts, it might also add a confounder. After all, tokens that represent real words will probably influence the attention in a different way than random numbers.
> Where UUIDs cost ~23 tokens and get hallucinated by LLMs
How does this solve the hallucination problem?
Just removing the - from the example UUID takes it from 26 tokens to 18
You can use the .from method https://github.com/vostride/id-agent/#idagentfrominput-opts
To convert uuid or any text to id-agent based id. Then do the LLM inference and then convert it back to UUID.
If true then that indeed seems like an improvement, I think I just need measurements of actual hallucinations. Calling hex random but a selection of words not seems humanly biased? If anything, being random is good because it's saying there's no semantic influence. I'd think that words are more likely to be hallucinated as certain words only follow certain contexts, which is less true for numbers
And according to the table below, an id-agent with 120 bits of entropy (still 2 bits less than UUID) uses 17 tokens on average. So unless you purposefully want to reduce the entropy, this whole scheme is just as good as just removing the dashes from UUIDs. But that wouldn't make for a resume-worthy project (sorry, got a bit cynical there)
That said, for `createAliasMap`, don't you think you could create a deterministic mapping from and to UUIDs <-> word chains? That way, no additional state would be needed. [Might require fairly long word chains...]
not that anyone should ever care; typos in random-looking ids are very real but already covered by human readable ids
besides, this is for a specific tokenizer