Uses PPO to learn Snake. PPO builds upon TRPO (I haven't implemented this), but estimates the Kullback-Leibler divergence by clipping which makes PPO more compute efficient. This implementation contains better replay memory (see the PPOMemory
class and compare against a simple deque
) than the A2C implementation which allows for the implementation of Generalised Advantage Estimation (GAE). This is important as PPO performs less well on sparse gradients than other off-policy methods. The actor and critic are split, which allows us to implement clipping on the actor while keeping the critic unconstrained. We also use multiple epochs of updates on the same data for better sample efficiency.
snake-ppo
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||