Bannach fixed point
operator T is a contraction map if when we apply it to x and y then the distance between T(x) and T(y) is smaller than the original distance between x and y.
The theorem states that when we repeatedly apply such an operator T we converge to a fixed point x*.
Value based methods
we examine a policy pi, and we define the value of each state V function: expected reward of starting from that state and following pi. V satisfies a recursive formula.
Q is the state-action function: the expected reward of starting in s, taking action a and then following pi.
We can calculate V by continuously applying the recursive V formula on the previous estimate of V. The formula is contracting, so we will finally get to a V that satisfies the recursive formula, so it must be V*.
The same can be done for Q
Optimal V and Q
V* and Q* are related. as V* is the maximum of all policies we can express each policy as a policy that takes a at its first step and then follow an arbitrary policy, which shows that V* is max(a, Q*) (2.3)
Bellman optimality: V* and Q* satisfy the following equations –
Q* of starting from s and following any action a is r(s,a) + expectation of discounted V* by probability of the state s’ we’ll get to. (2.5)
Together, (2.3) and (2.5) gives us that V*(s) is the max action of reward + expected V*(s’)
(2.4) the optimal policy is following the maximal Q*. This is shown as for solving the recursive formula of every V, V* with its companion Q* must be from a policy that selects the maximal Q*…