Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

1. It’s faster and cheaper to train a smaller model

2. Better than tokens is to train on probability distributions (distillation) and trees of probability distributions



I've never seen anything about training on probability distributions or trees of them. Do you have articles with examples you could share with us?

I did try a quick search for it. Found some interesting papers. The links to them are below in case anyone finds them interesting.

https://arxiv.org/abs/2212.11481

https://towardsdatascience.com/a-new-way-to-predict-probabil...

https://arxiv.org/pdf/1912.07913.pdf

https://dukespace.lib.duke.edu/dspace/bitstream/handle/10161...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: