Notes on policy gradients in auto-diff frameworks

29 May 2018

In my first blog post I’m attaching a mathematical perspective on policy gradient methods (just vanilla policy gradients here, but the same core steps extend to more sophisticated approaches) in the context of automatic differentiation (AD) frameworks such as Tensorflow. This post assumes familiarity with reinforcement learning and policy gradients in general (see Karpathy’s blog for a good and intuitive overview of policy gradients) as well as an understanding of some AD framework such as Tensorflow in the context of supervised learning: setting up a computational graph and an associated loss function, followed by running a train operation on this system to update the weights of the neural network.

The purpose of my notes is not to provide intuition; there are plenty of other resources for that, including the aforementioned blog post by Karpathy. Instead my aim is to bridge the gap between theory and intuition by showing some steps involved with transforming the policy gradient theorem into formulations applicable for automatic differentiation. Hopefully this is helpful when trying to implement a policy gradient method yourself - this blog provides an example implementation in Tensorflow. My colleague Erik Gärtner has a great blog post about how to implement policy gradients in Caffe, which as we experienced is a bit more tricky than in Tensorflow.

Click here for my notes on policy gradients in auto-diff frameworks.