> There is no general compression theory in which BWT could be described as a special case.
I would not really say this is true. BWT is spiritually similar to the various Prediction by Partial Matching (PPM) algorithms, except that instead of needing to decide how much context (i.e., preceding symbols) to use to model the next symbol, and carefully learning the probabilities for each unique context, it naturally sorts symbols with the same context (of _any_ length) so they appear next to each other, and relies on adaptation to update your learned statistics as you move from one context to the next, without ever explicitly tracking context boundaries [0].
I would not really say this is true. BWT is spiritually similar to the various Prediction by Partial Matching (PPM) algorithms, except that instead of needing to decide how much context (i.e., preceding symbols) to use to model the next symbol, and carefully learning the probabilities for each unique context, it naturally sorts symbols with the same context (of _any_ length) so they appear next to each other, and relies on adaptation to update your learned statistics as you move from one context to the next, without ever explicitly tracking context boundaries [0].
[0] N.J. Larsson, "The Context Trees of Block Sorting Compression," 1998. https://ieeexplore.ieee.org/abstract/document/672147