Back to News
Advertisement
Advertisement

⚡ Community Insights

Discussion Sentiment

100% Positive

Analyzed from 141 words in the discussion.

Trending Topics

#experts#batch#overlap#deepseek#small#token#general#more#meaningful#pro

Discussion (4 Comments)Read Original on HackerNews

zozbot234about 1 hour ago
> Barely amortising at the bottom. At small batch each new token added to the batch tends to activate fresh experts

Whether this is true depends on what you mean by small. In general, AIUI you don't need more than a handful of experts to get a meaningful probability of overlap. DeepSeek V4 Pro is an exceptionally sparse model and even there you start to get meaningful overlap for a batch size of 5 or more. Moreover, in general you can think of the average amount of activated experts for a batch of size b as being n(1 - (1 - k/n)^b) where k is the number of active and n of total experts. For DeepSeek V4, k=6 and n is 256 for Flash, 384 for Pro. (The sampling is repeated per layer, not just per token.)

somnial11 minutes ago
https://fergusfinn.com/blog/economics-of-speculative-decodin...

good point tho - plus for Deepseek the shared expert increases the overlap slightly

yorwba41 minutes ago
The article includes that formula too and takes the overlap into account in its calculations.
maherbeg35 minutes ago
I wonder if new models will be trained with speculative decoding as a core feature allowing fewer experts to be needed for a pass.