Yeah it's not clear to me why temporal difference learning all of a sudden works so well here? Is it the case that nobody had really tried it for learning a policy for Go with a strong NN architecture? In the Methods they mention TD learning for value functions but I don't see anything about policies.
edit: OK, they're calling it policy iteration as opposed to TD learning. I guess I don't get the difference.
TD learning is, in some sense, a component of policy iteration. TD learning is about learning the value function for a given policy. In policy iteration you use a value function to decide how to update the policy for which the value function was estimated, and you iterate between the "learn value" and "update policy" steps.
edit: OK, they're calling it policy iteration as opposed to TD learning. I guess I don't get the difference.