# RL 1

we are trying to learn a good policy without worrying about estimating all the parameters of the model. We want to maximize L = the expected reward for a finite horizon of states-actions path of length T.

The gradient of L is hard to calculate, but we use a log-derivative-trick which allows us to express the gradient as well as  an expectation over over samples of state-actions paths with length T! We can now empirically estimate this expectation.

# Reinforce

But there will be variance and we must minimize it. We can shift each component of the gradient $\theta_i$ by a constant. It won’t change the expectancy, since it’s own expectancy is 0, but it can minimize the variance which is a good thing.

# GOMDP

Another reduction in variance can be achieved. The policy gradient can be written as a double sum over the reward step k and the action time t. We can remove the terms where k < t intuitively as the reward was already given so it does not affect future errors.