You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Environment is currently very general and undifferentiated, hence not optimal.
Solution
One reasonable categorization for environments is SingleStep and MultiStep. SingleStep environments sample a question from a dataset and want the LLM to answer. Then, depending on the answer, they supply a reward. These are the ones that we usually use for LLM reasoning, like in Math and Coding. They are special, as step is only called once before the episode ends.
MultiStep environments on the other hand do not end after one step. Round-based games like TicTacToe and Chess are examples for this kind of environment.
I propose to implement 2 abstract classes, SingleStepEnv and MultiStepEnv from which single-step and multi-step environments can inherit. They provide some structure into the environment creation process.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
Required prerequisites
Motivation
Environment is currently very general and undifferentiated, hence not optimal.
Solution
One reasonable categorization for environments is
SingleStep
andMultiStep
.SingleStep
environments sample a question from a dataset and want the LLM to answer. Then, depending on the answer, they supply a reward. These are the ones that we usually use for LLM reasoning, like in Math and Coding. They are special, asstep
is only called once before the episode ends.MultiStep
environments on the other hand do not end after one step. Round-based games like TicTacToe and Chess are examples for this kind of environment.I propose to implement 2 abstract classes,
SingleStepEnv
andMultiStepEnv
from which single-step and multi-step environments can inherit. They provide some structure into the environment creation process.Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: