RU version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
74% Positive
Analyzed from 1387 words in the discussion.
Trending Topics
#don#model#details#more#models#need#context#something#attention#why

Discussion (29 Comments)Read Original on HackerNews
Because the cost to OpenAI to make an architectural shift is far greater than the cost to a new lab to try something different, providing details is usually a net benefit for recruiting, building trust, getting acquired, etc. The lack of details is a poor business decision because it makes them seem untrustworthy.
I'm not advocating that they should open source their model, but there is already so much noise in the space and many bad papers that being cagey is a poor strategy for winning over talent, developers, etc.
- guided window attn. Predict where to attend to but in a fixed window. If you do this to just the token/vocab you can keep effectively unlimited context and perfect recall. (yes, I can do that. There is a trick to teaching it how to predict position. This also immediately opens other crazy things like NN memory)
-efficient fixed state size models. So not a recurrent mechanism because that breaks training, parallelizable like transformers, but fixed sized state instead of unbounded attn. Pick a reasonable amount of state and it is amazingly good since it doesn't need to keep separating wheat fro chaff in context (yes, it is possible to build this, I have. It works. This also opens up real streamed models. I have a true infinite context streamed model I toy with locally that I am getting to be audio/text in and audio/text out in real time.)
Put those together and you have O(1) token gen, infinite context and perfect recall. It is a whole new world of models. You can interact with a model until you have it at the state you want and then save its state and use that as if it were your system prompt. Batches pack perfectly so inference is massively more efficient. Training is massively more efficient. Transformer and unlimited attn models are a dead end. But how do you make money on this as an independent researcher? If I release the Two Weird Tricks this is all based on I get zip and the big players get even more tech for free. If I keep it all secret I get Zip and eventually the tricks will be figured out. (Yes a little frustration here) If anyone wants the model architecture of the future make me an offer :)
My guess is that they're angling for an acquisition.
This is what I've thought was going to happen ever since they publicized their efforts. They probably don't have the money to train large models themselves, might as well get a nice chunk of change by being acquired by someone who already has said large models running.
Local inference is insanely fast on my M4 Pro MBP though, so I can understand where you're coming from, but I don't need it too much faster. I still need time to review, test, review and provide feedback to the model. Fast is okay I guess for true vibe coding.
> At 1M tokens, SubQ 1.1 Small requires 64.5x less compute than dense attention and runs 56x faster than FlashAttention-2.
Awesome stuff. Solving context at the model architecture layer rather than trying to bolt on extra memory is the right direction IMO.
If the results persists from 1M to 12M, why not 24M or 48M? Sounds almost too good to be true.
With back of the napkin math from inside my head, that'd be like 0.5/1 million LOC, depending on language/code density, could just fold the entire codebase into one prompt if it's a small one, that'd be neat :)
FlashAttention-2 is not used anymore for at least 2y.
This architecture would have been a massive improvement 3 years ago, but it is a ~solved~ problem IMO.
I get why they aren't disclosing all the details, but it seems more hype-train-esque to me for this moment. I don't disagree that this could be big.