Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
50% Positive
Analyzed from 435 words in the discussion.
Trending Topics
#model#policy#tool#correct#call#examples#teacher#reward#output#context
Discussion Sentiment
Analyzed from 435 words in the discussion.
Trending Topics
Discussion (8 Comments)Read Original on HackerNews
This is very interesting:
"Empirical Validation. While we cannot verify these theoretically, we evaluate each empirically. We use the Qwen-2.5-7B-Instruct model (Hui et al., 2024) as the base policy and the ToolAlpaca dataset (Tang et al., 2023). In this benchmark, the model receives a tool-API specification and a user request, and must identify the correct tool call. Without demonstrations, the base model solves only 42% of examples. When provided with the appropriate demonstration c for each prompt x , the teacher achieves a 100% success rate. To further test reward proximity, we manually inspected 50 teacher reasoning traces. In all cases, not only were the final tool calls correct, but the intermediate chain-of-thought was valid and semantically grounded. This suggests that the teacher is reconstructing a correct reasoning process rather than merely copying the expert output. These observations provide evidence for the first requirement, that the demonstration-conditioned model behaves as an optimal policy."
I find the choice of the words "enable" in the title and "establishing" at the end of the abstract to be particularly jarring.
Gemini tells me it's the probability of the next token for an LLM. Okay then.
This framing has been active for several years, as it’s the framing that enables RLHF and RLVR. RLHF itself is quite old, I think since the original chatGPT.
“Policy” here refers to a probability distribution, i.e. a function that, given some context, assigns probabilities to possible next tokens. It's what a model’s behavior looks like when viewed through an RL lens.
The paper discusses “on-policy” and “off-policy” training, which is central to its idea.
Off-policy training is what happens in standard supervised fine-tuning (SFT): the model is trained on examples that were produced independently of the model. This means that the examples have a different distribution than what the model produces. This can have a negative effect on previously learned capabilities.
On-policy training (in this context) uses data generated by the model itself. It samples the model’s own outputs, scores them against whatever results are being trained for, and updates the model based on those scores. This reinforces certain aspects of the model's own pretrained behavior, so is a "gentler" way to change the model's behavior. The authors claim that this reduces "catastrophic forgetting" and other negative consequences of SFT.
That would be incorrect. My other reply attempts to address this.