we are trying to learn a good policy without worrying about estimating all the parameters of the model. We want to maximize L = the expected reward for a finite horizon of states-actions path of length T.
The gradient of L is hard to calculate, but we use a log-derivative-trick which allows us to express the gradient as well as an expectation over over samples of state-actions paths with length T! We can now empirically estimate this expectation.
But there will be variance and we must minimize it. We can shift each component of the gradient by a constant. It won’t change the expectancy, since it’s own expectancy is 0, but it can minimize the variance which is a good thing.
Another reduction in variance can be achieved. The policy gradient can be written as a double sum over the reward step k and the action time t. We can remove the terms where k < t intuitively as the reward was already given so it does not affect future errors.