2. Better than tokens is to train on probability distributions (distillation) and trees of probability distributions
I did try a quick search for it. Found some interesting papers. The links to them are below in case anyone finds them interesting.
https://arxiv.org/abs/2212.11481
https://towardsdatascience.com/a-new-way-to-predict-probabil...
https://arxiv.org/pdf/1912.07913.pdf
https://dukespace.lib.duke.edu/dspace/bitstream/handle/10161...
2. Better than tokens is to train on probability distributions (distillation) and trees of probability distributions