Skip to content

A (not so) stupid example of DQN playing filetto (or tris or tic-tac-toe)

License

Notifications You must be signed in to change notification settings

kalbun/filettoQ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Preparation

To run filettoQ, you need to install numpy and tensorflow. If you want to get in-depth data analysis of your network, you will also need wandb and a valid API key.

Execution

Run python filettoQ.py from command line or via Visual Studio or any other development environment. The application has a switch for initialising epsilon. This can be useful in case you load the

Overall notes

Due to inexperience, writing the code has been much more complex than expected. I did many mistakes, possibly all those in the "manual of DQN beginner":

  • errors in the formula for calculating the return
  • errors in the simulator
  • network too large
  • gamma decay too fast

I also implemented some improvements over time that helped DQN converge:

  • minibuffers that proved to be quite a game changer.
  • reward normalisation. Rather than putting, for example, +100 in case of loss, I set -1 for forbidden move, 0.5 for victory, -0.5 for loss, 0.0 otherwise. Too high rewards were possibly introducing instability.
  • creation of an experience buffer storing 10000 samples
  • Weights update every 32 training cycles rather than at every cycle.

I didn't use smooth weights update from Q* to Q, but the network converges.

Problem definition

Two players (white and black) Board with nine positions Each position can be empty or contain either piece

Let's define board's positions as follow:

0   1   2
3   4   5
6   7   8

State and Actions

The board is completely defined with 18 elements, using one-hot coding. If state[0:18] is the vector describing board state, and c is a cell, then: state[c] = 1 if there is a white piece, 0 otherwise state[c+9] = 1 if there is a black piece, 0 otherwise

Definition of Q, Q

Q and Q* input size is 18 and output size is 9. This configuration comes from an Andrew Ng's suggestion (see Coursera specialisation in ML) and makes Q output the return value for all the nine moves. It is also possible to increase the input to 19 elements, also passing action a, and getting only return(s,a). Yet, it turns out that the chosen configuration is simpler to handle in code.

Rewards

We then assign: -1 for forbidden moves, not used as illegal moves are blocked before execution 0.5 for victory (human) -0.5 for loss 0 in all other cases

Simulator

The simulator receives the state s and an action a. It returns the new state S', the reward R, a boolean True if the game is over. If execution of a would bring to a forbidden configuration, then S' = S

filetto(S,A) -> S', R, gameover

Playing filetto

During the game, the computer uses Q.predict() to calculate the best from its point of view, that is, the move with lowest return.

About

A (not so) stupid example of DQN playing filetto (or tris or tic-tac-toe)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages