rsync was indeed a dramatic "this should not be possible" feat.
But the way it uses rolling hashes is not the simplest possible. One side sends hashes of fixed-size chunks, the other scans input for rolling hash matches, which only works for 2 sides and requires non-cachable CPU work on at least one side.
There is an even simpler idea at the core of many modern backup programs (eg. bup, attic, borg)! Possibly pioneered in `gzip --rsyncable`, not sure.
https://en.wikipedia.org/wiki/Rolling_hash#Content-based_sli...
Each side slices files into chunks at points where rolling hash % N == 0, giving chunks of variable size averaging N.
These points usually self-synchronize after insertions and deletions.
This allows not just comparing 2 files for common parts, but also deduplicated storage of ANY number of files as lists of chunk hashes, pointing to just one copy of each chunk!
There is an even simpler idea at the core of many modern backup programs (eg. bup, attic, borg)! Possibly pioneered in `gzip --rsyncable`, not sure. https://en.wikipedia.org/wiki/Rolling_hash#Content-based_sli... Each side slices files into chunks at points where rolling hash % N == 0, giving chunks of variable size averaging N. These points usually self-synchronize after insertions and deletions. This allows not just comparing 2 files for common parts, but also deduplicated storage of ANY number of files as lists of chunk hashes, pointing to just one copy of each chunk!