Xz format inadequate for long-term archiving (2016)

st_goliath · on March 30, 2024

This article is originally from 2016, not from 2022, as has also been pointed out when it was last posted circa 11 hours ago.

2024: https://news.ycombinator.com/item?id=39868810

2022: https://news.ycombinator.com/item?id=32210438

2019: https://news.ycombinator.com/item?id=20103255

2018: https://news.ycombinator.com/item?id=16884832

2016: https://news.ycombinator.com/item?id=12768425

It is utterly unrelated to the ongoing issue (which I'm 100% sure is the reason this is on the front page for a second time within 12 hours right now).

The author is also the developer of lzip, while it does have some valid points, one should apply a necessary level of skepticism ("Competitors product is really bad and unfit", says vendor).

Lockal · on March 31, 2024

While you are saying that it is utterly unrelated, I'll respond with "that not only related, this is a root cause of current situation".

We have a single massively used implementation (also known as reference), which does not even follow xz format specification. It is autohell-based, which allowed m4 injection. In other places it is CMake-based, but uses autohell-based code, inserting hidden dots where they could just use https://cmake.org/cmake/help/latest/module/CheckSymbolExists...

"Poor design" is still there, as a core of library. It is as cool, 64-bit square roots and 128-bit divides in Bcachefs[1]

It is as cool as lbzip2[2] with their:

  (((0xffffaa50 >> ((as < 0x20 ? as : (as >> 4)) & 0x1e)) & 0x3) + (as < 0x20 ? 1 : 5))

Spoiler: this lbzip2 code produces corrupted files in some cases, should we care if it is a backdoor? Or as usual, disable optimizations, disable valgrind, disable fuzzers and say that everything is ok?

[1] https://www.phoronix.com/news/Linux-6.9-Bcachefs-Attempt

[2] https://github.com/kjn/lbzip2/blob/b6dc48a7b9bfe6b340ed1f6d7...

tex0 · on March 30, 2024

Thank you for reposting. I don't want to start bashing XZ, but I honestly wonder why it's been picked up so much despite the valid criticism.

As if the compression rate increase at medium levels were so significant over bzip2 or grip.

adeon · on March 30, 2024

If I piggyback off one of the commenters on the 11 hours ago post: https://news.ycombinator.com/item?id=39868810#39869769

> I think none of these issues really matter.

I think that's probably it. I.e. it's a nobody-cares issue. Looking at the older posts, there's some serious issues with lzip too (looks like corruption-related) that got a less than stellar response from the author.

Just quickly tested lzip vs zstd -9 on a 100 megabyte text file. zstd is almost as good but many times faster. I wonder if the lzip author's work got obsolete.

I skimmed the manual of lzip a bit. The author likes to talk a lot about how everything is done correctly in lzip and there's this line line "The lzip format specification has been reviewed carefully and is believed to be free from design errors. " If you type lzip --help it talks about how it's better than bzip2 or gzip.

Maybe they are right but ugh comes off real arrogant.

Bulat_Ziganshin · on March 31, 2024

1. both lzip and xz are using lzma compression library internally, so there is no difference in their compression ratio/speed

2. lzma compression is LZ + markov chains, while zstd is LZ + order-0 entropy coder (similar to zlib, rar and many other popular algorithms)

markov chains are higher-order entropy coding, i.e. one using context of previous data. it's slower, but sometimes gives better compression. but text files don't get any benefit from it. OTOH, various binary formats, like executables or databases, get significantly better compression ratio. in my tests lzma-compressed binary files are ~~10% smaller on average.

so, many claims that zstd and lzma provides the same compression ratio, are based on testing on text or other files that don't benefit from higher-order entropy coding. of course, I imply maximum-compression setting and equal dictionary size, in order to make fair comparison of compression formats rather than particular implementations.

(I'm author of freearc and several LZ-based compressors, so more or less expert in this area :)

Maxious · on March 30, 2024

Phil Katz was real arrogant but also real right

wongarsu · on March 30, 2024

Because xz archives the best compression ratios. If you care about archive size above all else it's the best choice. If you use "medium levels" I agree there's no point, zstd is superior in that regime (achieving faster compression and decompression for these compression ratios)

lifthrasiir · on March 30, 2024

Any compressed format is okay as long as you have a working compressor and decompressor, where the compressor can reliably compress an input and the decompressor can reliably decompress the compressed input. Everything else is not as relevant, including the exact file format (because you have the known-good decompressor).

JoshTriplett · on March 30, 2024

xz is not worth using at medium levels. At its highest levels, it sometimes just barely has an edge over the highest levels of zstd, and it existed before zstd did.

These days, zstd is the obvious choice.

mort96 · on March 30, 2024

I never really understood the point of this article. The way to have reliable long term storage isn't to use compression formats which are resilient against bit flips; none of them really are. The way to have reliable long term storage is to protect against bit flips using checksums and redundancy.

These are alright criticisms of the xz format, but nothing which would make me wary of using xz for long term archival.

livrem · on March 30, 2024

I liked that the article begins with the Hoare quote "One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies" that was also the exact first thing I thought about when reading about the mess with the backdoor yesterday.

lifthrasiir · on March 30, 2024

That's irrelevant to the file formats themselves, however. A simple file format doesn't automatically guarantee a simple implementation, especially for compression algorithms.

lelanthran · on March 31, 2024

> That's irrelevant to the file formats themselves, however. A simple file format doesn't automatically guarantee a simple implementation,

Maybe it doesn't guarantee it, but at least a simple format makes it possible.

A complex format makes it impossible to have a simply implementation.

lifthrasiir · on March 31, 2024

But both the lzip and xz file format is much simpler than the actual meat, the compressed stream format. I agree that xz has lots of redundant and useless bits, but it is still simple enough to summarize in a single comment [1]. Lzip looks simple, but it actually uses a variant of LZMA stream format so it is not simpler to implement than xz since you can't even use a stock LZMA library.

[1] https://news.ycombinator.com/item?id=39873112

Bulat_Ziganshin · on April 2, 2024

only in indirect way - simpler format could allow to find another real maintainer *or even continue to maintain it himself), because it has less features.

but I think xz won the linux-lzma-archive formats war exactly because it was more complex than lzip, e.g. by supporting more versatile lzma2 format, which allowed to implement multi-threaded compression and decompression (pretty sophisticated things), and better crc64/sha256 hash checking

cess11 · on March 30, 2024

Over the past two-three decades of computering I've only noticed xz in deb-packages.

Who's using xz as a long-term archiving or general purpose compression format? TFA doesn't give any examples of this, only that the Debian project picked it for packages, presumably because it creates smaller files than gzip and bzip2. And apt packages are arguably very ephemeral, if a compressed file turned out bad or bitrotted before culling from the repo it's likely to be discovered and fixed easily.

I work in e-archiving, with public sector clients, where we actually do long-term storage of whatever data format. Never seen xz used there.

lifthrasiir · on March 30, 2024

Does e-archiving use bzip2? I think xz was mostly advertised (and successfully adopted) as a substitute for bzip2, because bzip2 was even slower than xz.

cess11 · on March 30, 2024

I hardly ever come across information-level compression, and when I do it's regular ZIP. The OAIS-implementations my clients use require ZIP or TAR on the packages going in and out of the archiving software, in practice it tends to ZIP. Not sure whether the OAIS itself limits the options in the same way.

tutfbhuf · on March 30, 2024

I wonder why no one mentioned Zstandard yet. It is 10x faster to decompress than most other formats and offers comparable compression ratios.

user20180120 · on March 30, 2024

Why is the Long Range Zip lrzip compression format not used? It gives better compression than xz when using the correct switches.

JoshTriplett · on March 30, 2024

zstd has a long-range mode, and is far more widely used.

m463 · on March 30, 2024

Took me a while to realize tar format doesn't have any crc/checksum.

(usually compression format has it though)

lifthrasiir · on March 30, 2024

I can see why the lzip author frustrated at the xz format, especially given that there are so many checksums and paddings around, but the lzip format is the opposite extreme.

7-zip already had a concept of multiple filters which contribute to its efficiency, and the underlying design of xz does capture them without much complication. For example, filters in the original 7-zip format (or "codecs") can have both multiple input and output streams [1]. This makes less sense for a single file compressor and xz carefully avoided them. The main problem with the xz format is not its concept but more about its concrete implementation: you don't need extensibility, you only need agility.

In comparison, lzip is too minimal. It might be technically agile by its version field, but it wouldn't if you do nothing and claim that you are open to any addition. It is not hard to pick some filters and mandate only most useful combinations of filters. The stream could have been periodically interrupted to give an early chance to detect errors before the member footer. (Unless lzip natively produces a multimember file even for a single input, which is AFAIK not the case.) The lzip author claims that a corruption in the compressed data can be detected from the decompression process, but that would mean too much redundancy in the compressed data, so this claim has been clearly misguided. And what the heck is that dictionary size coding? Compressed formats frequently make use of exponent-mantissa encodings but I have never seen an encoding where the mantissa is subtracted.

Of course, both should be avoided at this point because zstd is fast and efficient enough. Also, the file format for zstd is better than both in my opinion.

(I've posted the same comment in the older thread, and I also posted my summary of all three file formats so that you can feel what I'm talking about: https://news.ycombinator.com/item?id=39873112)

[1] https://py7zr.readthedocs.io/en/latest/archive_format.html#c...

Bulat_Ziganshin · on April 2, 2024

afaik, 7-zip filters can't have multiple inputs (at the encoding stage).

multiple outputs are necessary for filters that output multiple independent data streams such as bcj2. and they are equally useful for archivers and compressors.

(I'm author of freearc, another archiver software, and multiple compression algos)

PS: thank you for format comparison, it would be great to put xz format description onto its Wikipedia page. I already used you description to understand why attackers added 8 "random" bytes to one of their scripts - probably to "fix" crc-64 value.

lifthrasiir · on April 3, 2024

(Welcome to the HN, by the way! While I have also written my own compressor, I'm just a hobbyist compared to people in encode.su, to be sure :-)

> afaik, 7-zip filters can't have multiple inputs (at the encoding stage).

I don't know whether this is actually used or not, but py7zr's specification clearly mentions that complex coders have `NumInStreams` and `NumOutStreams` parameters specified.

> multiple outputs are necessary for filters that output multiple independent data streams such as bcj2. and they are equally useful for archivers and compressors.

Correct in principle, but those logical streams can still be framed and concatenated into a single physical stream. Any compressor that can handle non-stationary distributions would work well, it might be even possible to hint the logical stream boundary to guide compressors.

Several physical streams are absolutely necessary when those streams are going through different filters, but a lack of them doesn't harm too much either. Also I think xz was planning for "subblock" filters to address this use case anyway.