Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yeah it's not clear to me why temporal difference learning all of a sudden works so well here? Is it the case that nobody had really tried it for learning a policy for Go with a strong NN architecture? In the Methods they mention TD learning for value functions but I don't see anything about policies.

edit: OK, they're calling it policy iteration as opposed to TD learning. I guess I don't get the difference.



TD learning is, in some sense, a component of policy iteration. TD learning is about learning the value function for a given policy. In policy iteration you use a value function to decide how to update the policy for which the value function was estimated, and you iterate between the "learn value" and "update policy" steps.

https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume4/kaelb...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: