Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

论文里PPO的公式写错了 #37

Open
itmorn opened this issue Feb 14, 2025 · 0 comments
Open

论文里PPO的公式写错了 #37

itmorn opened this issue Feb 14, 2025 · 0 comments

Comments

@itmorn
Copy link

itmorn commented Feb 14, 2025

Image
应该改为:
Image

原因:因为你是对π_θ_old求的期望,你在求KL散度的时候肯定不是π_θ在分子上。另外ref模型应该是GRPO里才有的概念(用来约束当前模型和该iter开始时刻模型的更新幅度),在PPO里应该只有old模型,所以应该是π_θ_old在分子上。

可以参考InstructGPT:
Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant