Advertisement
Advertisement
β‘ Community Insights
Discussion Sentiment
82% Positive
Analyzed from 1321 words in the discussion.
Trending Topics
#tokenizer#model#more#kill#tokens#space#models#might#same#claude
Discussion Sentiment
Analyzed from 1321 words in the discussion.
Trending Topics
Discussion (24 Comments)Read Original on HackerNews
Interesting. Unfortunately Anthropic doesn't actually share their tokenizer, but my educated guess is that they might have made the tokenizer more semantically aware to make the model perform better. What do I mean by that? Let me give you an example. (This isn't necessarily what they did exactly; just illustrating the idea.)
Let's take the gpt-oss-120b tokenizer as an example. Here's how a few pieces of text tokenize (I use "|" here to separate tokens):
You have 3 different tokens which encode the same word (Kill, kill, <space>kill) depending on its capitalization and whether there's a space before it or not, you have separate tokens if it's the past tense, etc.This is not necessarily an ideal way of encoding text, because the model must learn by brute force that these tokens are, indeed, related. Now, imagine if you'd encode these like this:
Notice that this makes much more sense now - the model now only has to learn what "<capitalize>" is, what "kill" is, what "<space>" is, and what "ed" (the past tense suffix) is, and it can compose those together. The downside is that it increases the token usage.So I wouldn't be surprised if this is what they did. Or, my guess number #2, they removed the tokenizer altogether and replaced them with a small trained model (something like the Byte Latent Transformer) and simply "emulate" the token counts.
[1] https://arxiv.org/pdf/2507.06378
[2] https://pieter.ai/bpe-knockout/
Also, think about how a LLM would handle different languages.
See embedding models.
> they removed the tokenizer altogether
This is an active research topic, no real solution in sight yet.
Case sensitive language models have been a thing since way before neural language models. I was using them with boosted tree models at least ten years ago, and even my Java NLP tool did this twenty years ago (damn!). There is no novelty there of course - I based that on PG's "A Plan for Spam".
See for example CountVectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.fe...
The bitter lesson says that you are much better off just adding more data and learning the tokenizer and it will be better.
It's not impossible that the new Opus tokenizer is based on something learnt during Mythos pre-training (maybe it is *the learned Mythos tokenizer?%), and it seems likely that the Mythos pre-training run is the most data ever trained on.
Putting an inductive bias in your tokenizer seems just a terrible idea.
I have just ran an experiment: I have taken a word and asked models (chatgpt, gemini and claude) to explode it into parts. The caveat is that it could either be root + suffix + ending or root + ending. None of them realized this duality and have taken one possible interpretation.
Any such approach to tokenizing assumes context free (-ish) grammar, which is just not the case with natural languages. "I saw her duck" (and other famous examples) is not uniquely tokenizable without a broader context, so either the tokenizer has to be a model itself or the model has to collapse the meaning space.
Let's say you spend $100 on tokens, how can you be sure it's "too much" compared to the value you're getting?
I would expect that - once you're out of the experimentation phase - you should be able to say whether or not spending $100 returns at least $100 in business value.
Right?
I'm still with them cause the model is good, but yes, I'm noticing my limits burning up somewhat faster on the 100 USD tier, I bet the 20 USD tier is even more useless.
I wouldn't call it a rugpull, since it seems like there might be good technical reasons for the change, but at the same time we won't know for sure if they won't COMMUNICATE that to us. I feel like what's missing is a technical blog post that tells uz more about the change and the tokenizer, although I fear that this won't be done due to wanting to keep "trade secrets" or whatever (the unfortunate consequence of which is making the community feel like they're being rugpulled).
See for example the price difference between taking a taxi and taking the bus, or between hiring a real lawyer Vs your friend at the bar who will give his uninformed opinion for a beer.
You'll be better using Qwen 3.6 Plus through Alibaba coding plan.
Is there a quality increase from this change or is it a money grab ?
Comparisons are still ongoing but I have already seen some that suggest that Opus 4.7 might on average arrive at the answer with fewer tokens spent, even with the additional tokenizer overhead.
So, no, not a money grab.
Note that they're the only provider which doesn't make their tokenizer available offline as a library (i.e. the only provider whose tokenizer is secret).