Policy Gradient Implementation Using Gym and Tensorflow

Policy gradient network is implemented using popular atari game, Pong Game. "Policy gradients method involves running a policy for a while, seeing what actions lead to high rewards, increasing their probability through backpropagating gradients".

If there is a large scale problems which is aimed to solve, a type of function approximator should be used. In this problem, a neural network is used as function approximator. There are too many states and/or actions to store in memory, so look up table can not be used.

Andrej Karpathy (Deep Reinforcement Learning: Pong from Pixels): http://karpathy.github.io/2016/05/31/rl/

Policy Gradient Neural Network, based on Andrej’s solution, will do:

take in images from the game and "preprocess" them (remove color, background, etc).
use the TF NN to compute a probability of moving up or down.
sample from that probability distribution and tell the agent to move up or down.
if the round is over, find whether you won or lost.
when the episode has finished, pass the result through the backpropagation algorithm to compute the gradient for weights.
after each episodes have finished, sum up the gradient and move the weights in the direction of the gradient.
repeat this process until weights are tuned to the point.

PongGame Experiment Results

After a period time, scores are getting better.

After 2 days running, system is learned and starting to beat opponent. Last saved checkpoint which is learned after 2days is committed in the checkpoint folder. When starting code in your environment, if there is a checkpoint point folder, it will be loaded..