RL 2

Bannach fixed point

operator T is a contraction map if when we apply it to x and y then the distance between T(x) and T(y) is smaller than the original distance between x and y.

The theorem states that when we repeatedly apply such an operator T we converge to a fixed point x*.

Value based methods

we examine a policy pi, and we define the value of each state V function: expected reward of starting from that state and following pi. V satisfies a recursive formula.

Q is the state-action function: the expected reward of starting in s, taking action a and then following pi.

We can calculate V by continuously applying the recursive V formula on the previous estimate of V. The formula is contracting, so we will finally get to a V that satisfies the recursive formula, so it must be V*.

The same can be done for Q

Optimal V and Q

V* and Q* are related. as V* is the maximum of all policies we can express each policy as a policy that takes a at its first step and then follow an arbitrary policy, which shows that V* is max(a, Q*) (2.3)

Bellman optimality: V* and Q* satisfy the following equations –

Q* of starting from s and following any action a is r(s,a) + expectation of discounted V* by probability of the state s’ we’ll get to. (2.5)

Together, (2.3) and (2.5) gives us that V*(s) is the max action of reward + expected V*(s’)

(2.4) the optimal policy is following the maximal Q*. This is shown as for solving the recursive formula of every V, V* with its companion Q* must be from a policy that selects the maximal Q*…





Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s