Yeah it's not clear to me why temporal difference learning all of a sudden works...

psb217 · on Oct 18, 2017

TD learning is, in some sense, a component of policy iteration. TD learning is about learning the value function for a given policy. In policy iteration you use a value function to decide how to update the policy for which the value function was estimated, and you iterate between the "learn value" and "update policy" steps.

https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume4/kaelb...