Simple DQN result simulation (without n-step, Experience replay, etc.) in Webots virtual environment using the Q network that produced the Best Reward in training:
The observations for this E-puck robot were 3 continuous ground sensor readings mounted on this robot that gave higher readings when the ground color was white. The action space was discrete, i.e. either turn clockwise/counter-clockwise or don't turn and move forward. Reward history plot:
I created another line following task using the PyBullet package to try making everything python based. The robot is a custom 4-wheeled robot. The observation space is discrete 3 readings indicating if the left, center and right of the front of the robot is black or white. The robot has a constant speed. The action space is a continuous value from -1 to 1 radians controlling the steering. TRPO algorithm is used: