My standard test is compressing a "dd" disc image of a Linux install (I use thes...

palaiologos · 2025-02-01T20:59:14 1738443554

Hi, tool author here!

Thank you for your benchmark!

As you may be aware, different compression tools fill in different data type niches. In particular, less specialised statistical methods (bzip2, bzip3, PPMd) generally perform poorly on vaguely defined binary data due to unnatural distribution of the underlying data that at least in bzip3's case does not lend well to suffix sorting.

Conversely, Lempel-Ziv methods usually perform suboptimally on vaguely defined "textual data" due to the fact that the future stages of compression that involve entropy coding can not make good use of the information encoded by match offsets while maintaining fast decompression performance - it's a long story that I could definitely go into detail about if you'd like, but I want to keep this reply short.

All things considered, data compression is more of an art than science, trying to fit in an acceptable spot on the time to compression ratio curve. I created bzip2 as an improvement to the original algorithm, hoping that we can replace some uses of it with a more modern and worthwhile technology as of 2022. I have included benchmarks against LZMA, zstandard, etc. mostly as a formality; in reality if you were to choose a compression method it'd be very dependent on what exactly you're trying to compress, but my personal stance is that bzip3 would likely be strictly better than bzip2 in all of them.

bzip3 usually operates on bigger block sizes, up to 16 times bigger than bzip2. additionally, bzip3 supports parallel compression/decompression out of the box. for fairness, the benchmarks have been performed using single thread mode, but they aren't quite as fair towards bzip3 itself, as it uses a way bigger block size. what bzip3 aims to be is a replacement for bzip2 on modern hardware. what used to not be viable decades ago (arithmetic coding, context mixing, SAIS algorithms for BWT construction) became viable nowadays, as CPU Frequencies don't tend to change, while cache and RAM keep getting bigger and faster.

linsomniac · 2025-02-01T22:10:14 1738447814

Thanks for the reply. I just figured I'd try it and see, and the bzip3 results are extremely good. I figured it was worth trying because a fair bit of the data in that image is non-binary (man pages, config files, shell/python code), but probably the bulk of it is binary (kernel images, executables).

andix · 2025-02-01T22:03:22 1738447402

Shouldn't a modern compression tool, targeting a high compression rate, try to switch its compression method on the fly depending on the input data?

I have no idea about compression, just a naive thought.

supertrope · 2025-02-02T01:04:05 1738458245

7-Zip can apply a BCJ filter before LZMA to more effectively compress x86 binaries. https://www.7-zip.org/7z.html. Btrfs’ transparent compression feature checks if the first block compressed well; if not it gives up for the rest of the file.

kristofferR · 2025-02-02T05:09:52 1738472992

https://en.wikipedia.org/wiki/PAQ

nmz · 2025-02-02T23:51:55 1738540315

If the focus is on text, then the best example is probably the sqlite amalgation file which is a 9mb C file.

linsomniac · 2025-02-01T22:05:12 1738447512

A couple other data points:

    zstd --long --ultra -2:                                1,062,475,298
    zstd --long=windowLog --zstd=windowLog=31 --ultra -22: 1,041,203,362

So for my use case the additional settings don't seem to make sense.