first commit.

illidanlab · Jun 25, 2019 · f216fdf · f216fdf
commit f216fdf
Show file tree

Hide file tree

Showing 61 changed files with 9,623 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,175 @@
+## Core latex/pdflatex auxiliary files:
+*.aux
+*.lof
+*.log
+*.lot
+*.fls
+*.out
+*.toc
+*.fmt
+.DS_Store
+*/temp/*
+*.pyc
+*./.idea/*
+.idea/* 
+*.DS_Store*
+*.ipynb_checkpoints/*
+notebooks/.ipynb_checkpoints/*
+*.dropbox*
+*Icon*
+*/__pycache__/*
+*/.ipynb_checkpoints/*
+## Intermediate documents:
+*.dvi
+*-converted-to.*
+# these rules might exclude image files for figures etc.
+# *.ps
+# *.eps
+# *.pdf
+
+## Bibliography auxiliary files (bibtex/biblatex/biber):
+*.bbl
+*.bcf
+*.blg
+*-blx.aux
+*-blx.bib
+*.brf
+*.run.xml
+
+## Build tool auxiliary files:
+*.fdb_latexmk
+*.synctex
+*.synctex.gz
+*.synctex.gz(busy)
+*.pdfsync
+
+## Auxiliary and intermediate files from other packages:
+# algorithms
+*.alg
+*.loa
+
+# achemso
+acs-*.bib
+
+# amsthm
+*.thm
+
+# beamer
+*.nav
+*.snm
+*.vrb
+
+# cprotect
+*.cpt
+
+#(e)ledmac/(e)ledpar
+*.end
+*.[1-9]
+*.[1-9][0-9]
+*.[1-9][0-9][0-9]
+*.[1-9]R
+*.[1-9][0-9]R
+*.[1-9][0-9][0-9]R
+*.eledsec[1-9]
+*.eledsec[1-9]R
+*.eledsec[1-9][0-9]
+*.eledsec[1-9][0-9]R
+*.eledsec[1-9][0-9][0-9]
+*.eledsec[1-9][0-9][0-9]R
+
+# glossaries
+*.acn
+*.acr
+*.glg
+*.glo
+*.gls
+
+# gnuplottex
+*-gnuplottex-*
+
+# hyperref
+*.brf
+
+# knitr
+*-concordance.tex
+*.tikz
+*-tikzDictionary
+
+# listings
+*.lol
+
+# makeidx
+*.idx
+*.ilg
+*.ind
+*.ist
+
+# minitoc
+*.maf
+*.mtc
+*.mtc[0-9]
+*.mtc[1-9][0-9]
+
+# minted
+_minted*
+*.pyg
+*.pyc
+# morewrites
+*.mw
+
+# mylatexformat
+*.fmt
+
+# nomencl
+*.nlo
+
+# sagetex
+*.sagetex.sage
+*.sagetex.py
+*.sagetex.scmd
+
+# sympy
+*.sout
+*.sympy
+sympy-plots-for-*.tex/
+
+# pdfcomment
+*.upa
+*.upb
+
+#pythontex
+*.pytxcode
+pythontex-files-*/
+
+# Texpad
+.texpadtmp
+
+# TikZ & PGF
+*.dpth
+*.md5
+*.auxlock
+
+# todonotes
+*.tdo
+
+# xindy
+*.xdy
+
+# xypic precompiled matrices
+*.xyc
+
+# WinEdt
+*.bak
+*.sav
+
+# endfloat
+*.ttt
+*.fff
+
+# Latexian
+TSWLatexianTemp*
+
+main.pdf
+
+*.dropbox*
+
diff --git a/README.md b/README.md
@@ -0,0 +1,48 @@
+# Ranking Policy Gradient
+Ranking Policy Gradient (RPG) is a sample-efficienct  policy gradient method
+that learns optimal ranking of actions with respect to the  long term reward.
+This codebase contains the implementation of RPG using the
+[dopamine](https://github.com/google/dopamine) framework. 
+
+
+## Instructions
+
+
+### Install via source
+#### Step 1. 
+Follow the install [instruction](https://github.com/KaixiangLin/dopamine/blob/master/README.md#install-via-source) of 
+dopamine framework for [Ubuntu](https://github.com/KaixiangLin/dopamine/blob/master/README.md#ubuntu) 
+or [Max OS X](https://github.com/KaixiangLin/dopamine/blob/master/README.md#mac-os-x). 
+
+#### Step 2. 
+Download the RPG source, i.e.
+
+```
+git clone [email protected]:illidanlab/rpg.git
+```
+
+
+## Running the tests
+
+```
+cd ./rpg/dopamine 
+python -um dopamine.atari.train \
+  --agent_name=rpg \
+  --base_dir=/tmp/dopamine \
+  --random_seed 1 \
+  --game_name=Pong \
+  --gin_files='dopamine/agents/rpg/configs/rpg.gin'
+```
+
+## Reproduce 
+To reproduce the results in the paper, please refer to the instruction in [here](code.md). 
+
+### Reference
+
+If you use this RPG implementation in your work, please consider citing the following papers:
+```
+TODO(RPG): 
+```
+
+## Acknowledgments
+TODO(dopamine framework, fundings). 
diff --git a/code.md b/code.md
@@ -0,0 +1,65 @@
+# Overview
+
+This document explain the structure of this codebase and hyperparameters of experiments. 
+
+
+## File organization
+
+### Step 1. 
+Please refer to the instruction of dopamine structure in [here](https://github.com/KaixiangLin/dopamine/blob/master/docs/README.md#file-organization)
+
+### Step 2. 
+We add variants of RPG agents in [this folder](dopamine/dopamine/agents) and we explain each agent as follows:
+
+
+|  Folder | Exploration  |  Supervision | 
+|---|---|---|
+| rpg  | epsilon-greedy  |  RPG (Hinge loss) |
+| lpg  | epsilon-greedy|  LPG (Cross-Entropy) |
+| epg  | EPG   | LPG (Cross-Entropy) |
+|repg  | EPG   |  RPG (Hinge loss) |
+|implicit_quantilerpg| implicit_quantile  |  RPG (Hinge loss) |
+
+
+* EPG: EPG is the stochastic listwise policy gradient 
+with off-policy supervised learning, which is the vanilla policy gradient trained 
+with off-policy supervised learning. The exploration and supervision agent is parameterized 
+by the same neural network. The supervision agent minimizes the cross-entropy loss 
+over the near-optimal trajectories collected in an online fashion.
+
+* LPG: LPG is the deterministic listwise policy gradient with off-policy supervised learning. 
+We choose an action greedily based on the value of logits during the evaluation, and it stochastically 
+explores the environment as EPG.
+
+* RPG: RPG explores the environment using a separate agent: epsilon-greedy, EPG in Pong and 
+Implicit Quantile in other games. Then rpg conducts supervised
+learning by minimizing the hinge loss. 
+
+In this codebase, the folder [rpg](dopamine/dopamine/agents/rpg) 
+contain the code of RPG with epsilon-greedy exploration, and similarly [repg](dopamine/dopamine/agents/repg) for EPG exploration, 
+[implicit_quantilerpg](dopamine/dopamine/agents/implicit_quantilerpg)
+ for implicit quantile network exploration. 
+
+The agents with relatively simple exploration strategy (rpg, lpg, epg, repg) perform well on Pong,
+comparing to the state-of-the-arts, since there are higher chance to hit the good trajectories with in Pong. 
+For more complicated games, we adopt implicit quantile network as the exploration agent. 
+
+## Hyperparameters
+The hyperparameters of networks, optimizers, etc., are same as the [baselines](https://github.com/KaixiangLin/dopamine/tree/master/baselines) in dopamine. 
+The trajectory reward threshold c (see Def 5 in the paper (TODO)) for each game is given as follows:
+
+| game  | c  |
+|---|---|
+|  Boxing | 100  |
+|  Breakout | 400  |
+|  Bowling | 80  |
+|  BankHeist | 1100  |
+|  DoubleDunk | 18  |
+|  Pitfall | 0  |
+|  Pong |  1 |
+|  Robotank| 65  |
+
+
+
+
+
diff --git a/dopamine/dopamine/__init__.py b/dopamine/dopamine/__init__.py
@@ -0,0 +1,15 @@
+# coding=utf-8
+# Copyright 2018 The Dopamine Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+name = 'dopamine'
diff --git a/dopamine/dopamine/agents/__init__.py b/dopamine/dopamine/agents/__init__.py
@@ -0,0 +1,15 @@
+# coding=utf-8
+# Copyright 2018 The Dopamine Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+