Yes, the "label bias" is more of a structured learning / joint learning term that is present in natural language processing. But reinforcement learning suffers only if you do the learning to minimize local loss of the decision (label) - if you try to build a classifier that minimizes its loss on local decisions, instead on sequence of decisions.
Their value policy network isn't trained jointly and can compound errors. There are approaches with deep neural networks that don't have a joint training but work pretty well. The reason is that networks have a pretty good memory/representation and by that they avoid much of the problems. But for huge games like Go it is quite possible that more games need to be played for these non-structured models to work well.
Again, I don't think we're talking about the same concept. I also fail to see how training over an entire trajectory is going to help you with trajectories you've never seen. Also, these nets are definitely trained with discounted long-term rewards.
They train using trajectories but train them to guess the trajectory locally, not globally. Discounted long-term rewards are just a hack, they aren't joint learning.
The concept of label bias, or decision bias is a joint/structured learning concept. It is a machine learning concept, it has nothing to do with the application. There are training modes with mathematical guarantee that the local decisions will minimize the future regret.
Joint learning is done not on the whole permutation but on the markov-chain of decisions, which is sometimes a good enough assumption. For example, the value policy network of AlphaGo is percisely a Markov chain, given a state, tell me which next state has the highest probability of victory. The search then tries to find the sequence of moves that will maximize the probability, and then it makes the best local decision (one move). It works like limited depth min-max or beam search. They do rollouts (play the whole game) to train the value network, but it is now a question if they train it to minimize the local loss of the made decisions, or if they train it to minimize the future regret of a local decision. As I've stated before, minimizing joint loss over the sequence, or minimizing local loss over each of made decisions, is exactly influencing if there will be bias or not.
The whole point of reinforcement learning is to create a huge enough dataset to overcome the trajectories-not-seen problem. The training of the models for playing Go is entirely a whole different kind of a problem.
Now when they have hundreds of millions of meaningful games they can skip the reinforcement learning and just learn from the games.
The illustration of the "label bias" problem is available in one source I referenced. Terms like compounding errors and unseen state are there. The "label bias" is present only in discriminative models not generative ones. Which means that AlphaGo - being a discriminative model, can suffer from "label bias" if it wasn't trained to avoid it.
Yes, I'm pretty sure we're not talking about the same thing. I'm precisely talking about the trajectories not seen problem. Nothing is going to save you from the fact that the net has not seen a certain state before.
That's not really a problem. Given a large enough dataset you want to generalize from it - there are always states not present in the dataset - the whole point is now to extract features out of your dataset to allow generalization on unseen states. Seeing all of the Go games isn't possible.
The compounding errors problem that stems from decision bias isn't because you haven't seen the trajectory, it is because the model isn't trained jointly.
We're talking about the same thing. You just aren't familiar with the difference present between joint learning discriminative models and local decision classifiers (Markov entropy model vs conditional random fields - or recursive CNNs trained on joint loss over the sequence or recursive CNNs trained to minimize the loss of all local decisions).
In the case of Go, one would try to minimize the loss over the whole game of Go, or over the local decisions made during the game of Go. The latter will result in decision bias - that will lead to compounding errors. The joint learning has a guarantee that the compounding error has a globally sound bound. (proofs are information theory based and put mathematical guarantees on discriminative models applied to sequence labelling (or sequence decision making))
edit:
Checkout the lecture below, around the 16 minute mark it has a Super Mario example and describes exactly the problem you mentioned. The presenter is one of leading figures in joint learning.
It is completely supervised learning problem. But, look at reinforcement learning as a process that has to have a step of generating a meaningful game from which a model can learn. After you have generated bazillion of meaningful games you can discard the reinforcement and just learn. You now try to get as close to the global "optimal" policy as you can, instead of trying to go from an idiot player to a master.
Of course, the data will have flaws if your intermediate model plays with a decision bias. So, instead of training the intermediate to have a bias, train it without :D
Yes, Hal Daume is referring to the issue I brought up. I'm not interpreting his comments as referring to issues with training the model jointly - he's referring to exactly what I'm describing - never having even seen the expert make a mistake. The only solution is to generate trajectories more intelligently (which is in line with Daume's comments).
Yes, it is true. In the case of Super Mario he does the learning by simulating level-K BFS from positions that resulted in errors (unseen states) and thus minimizes the regret for the next K moves.
Although, if you checkout his papers, the problems I've talked about, when you have more than enough data and when you know you should be able to generalize well you still can get subpar performance if you don't optimize jointly. AlphaGo model isn't optimizied jointly but its power mostly lies in the extreme representation ability of deep neural networks.
Their value policy network isn't trained jointly and can compound errors. There are approaches with deep neural networks that don't have a joint training but work pretty well. The reason is that networks have a pretty good memory/representation and by that they avoid much of the problems. But for huge games like Go it is quite possible that more games need to be played for these non-structured models to work well.