-
-
Notifications
You must be signed in to change notification settings - Fork 907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vector API - design considerations #32
Comments
An additional change would be when |
Hi, would it make sense for |
Thanks for your thoughts on this @MarcCote ! Sven here from the RLlib team :)
On the autoreset issue you mentioned: I would prefer autoreset (similar to how deepmind has always done it), but then the per-episode seeding (and option'ing) would not be accessible anymore. If per-episode seeding and option'ing is really essential for users, then we won't be able to avoid having an additional |
Thanks for your thoughts @sven1977 Im unfamiliar with how deepmind has achieved environment vectorisation, could you provide a short summary of the difference between deepminds approach and openai's approach (or envpool) For being able to set the autoreset seed, this is an interesting idea that I hadn't heard of before. I don't think we will include this in the The current |
Discussion on the @RedTachyon @vwxyzjn @araffin Would be interested in your thoughts Considering a very simple training loop import gymnasium as gym
env = gym.make("CartPole-v1")
obs, info = env.reset()
for training_step in range(1_000):
action = policy(obs)
next_obs, reward, terminated, truncated, next_info = env.step(action)
replay_buffer.append((obs, info, action, next_obs, reward, terminated, truncated))
if terminated or truncated:
obs, info = env.reset()
else:
obs, info = next_obs, next_info The import gymnasium as gym
env = gym.make("CartPole-v1")
env = gym.wrappers.AutoResetWrapper(env)
obs, info = env.reset()
for training_step in range(1_000):
action = policy(obs)
next_obs, reward, terminated, truncated, next_info = env.step(action)
if terminated or truncated:
replay_buffer.append((obs, info, action, next_info["final_observation"], reward, terminated, truncated))
else:
replay_buffer.append((obs, info, action, next_obs, reward, terminated, truncated))
obs, info = next_obs, next_info I propose the following change to the import gymnasium as gym
env = gym.make("CartPole-v1")
env = gym.experimental.wrappers.AutoReset(env)
obs, info = env.reset()
for training_step in range(1_000):
action = policy(obs)
next_obs, reward, terminated, truncated, next_info = env.step(action)
replay_buffer.append((obs, info, action, next_obs, reward, terminated, truncated))
obs, info = next_obs, next_info Advantages
Disadvantage
|
I started trying to implement a new, more sane vector API. And then I quickly realized that it is, indeed, as messy as I could have expected, so the code will have to wait for some time.
Here I want to dump my thoughts about how this whole thing should/could look so that we have a discussion going.
Main desired outcome: we can use the vector API to easily create envs vectorized through either simple vectorization, or jax vmapping (or any other fast mechanism). This can give us huge performance improvements for some envs without relying on additional external libraries. For other envs, we default to Sync/Async/EnvPool?
Current situation: vectorization is only possible via Sync/Async, which is slow af, but very general. EnvPool (not officially supported) only works with some envs, but is faster. Other existing options are generally similar to Sync/Async, with their own quirks (e.g. ray in rllib, or the custom implementation in SB3)
The main complication is wrappers. If an environment provides its own optimized vectorized version, then we can't apply single-env wrappers to it. A nice solution would be an automatic conversion from a
Wrapper
to aVectorWrapper
, but that seems either very tricky or impossible to do in a general case. Fortunately, many actual wrappers don't need that "general case" treatment.The hope I see for this is switching to lambda wrappers, at least for some of the existing wrappers. ActionWrappers, ObservationWrappers and RewardWrappers can in principle be stateful, which requires some workarounds to map them over vectorized envs. With lambda wrappers, we can literally just do a map.
An element that I think will be crucial is different levels of optimization - existing third-party environments and wrappers should work exactly the same way, with the clunky subprocess vecenv approach, unless they do a few extra things to opt-in for the improvements.
Another rough edge might be autoreset. Currently this concept is barely present in gym, it's an optional wrapper for single envs, and in that scope it works fine. In a vectorized case, it's more important and a bit more complicated. If we don't have some sort of autoreset by default in vector envs, that makes them borderline useless for many envs (consider cartpole where the first env instance happens to take 10 steps, and the second takes 100 steps - if we only reset after both are terminated, we just lost 45% of the data)
While a vectorized autoreset is trivial with a subprocess-like vector env, that's not the case with e.g. numpy/jax acceleration. While I can see some hacks that maybe would kinda work to add it in some of these cases via wrapper, we might just have to add a requirement that the environment handles autoreset itself. Note that this wouldn't be a breaking change in env design - envs that don't have built-in autoreset can still use the standard vectorization. But if you want to use vectorized wrappers and the more efficient vectorization paradigm, you need to add it.
Finally, a question is - how much can we break? I'm not aware of any significant usage of
gym.vector
, though I know it is used at least sometimes. Ideally I'd like to keep the outside API as similar as possible, perhaps even exactly the same (with additional capabilities). But can we change some of the internal semantics that are in principle exposed to the public, but are also just one of the few remaining relics of the past? As I recall, we want to do the vector revamp before 1.0, which is good, because after 1.0 we have to be very careful about breaking stuff.Below I'm including a braindump of my semi-structured thoughts on this, just to have it recorded here with some additional details (most of this was mentioned above):
gym.Env
we can actually apply them togym.Env
, to which individual wrappers are appliedobservation
reward
action
vmap
np.array(map)
ornp.array([... for o in obs])
self.optimization: Literal["numpy"] | Literal["jax"] | None
Issues in the meantime:
Questions:
The text was updated successfully, but these errors were encountered: