RL 1


we are trying to learn a good policy without worrying about estimating all the parameters of the model. We want to maximize L = the expected reward for a finite horizon of states-actions path of length T.

The gradient of L is hard to calculate, but we use a log-derivative-trick which allows us to express the gradient as well as  an expectation over over samples of state-actions paths with length T! We can now empirically estimate this expectation.


But there will be variance and we must minimize it. We can shift each component of the gradient \theta_i by a constant. It won’t change the expectancy, since it’s own expectancy is 0, but it can minimize the variance which is a good thing.


Another reduction in variance can be achieved. The policy gradient can be written as a double sum over the reward step k and the action time t. We can remove the terms where k < t intuitively as the reward was already given so it does not affect future errors.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s