> 1. We don’t want to store a hash for every keystroke a user makes. So, why doe...

josephg · on May 11, 2023

> So, why does it have to be every keystroke? We can batch them locally.

This approach would absolutely work. Thanks for bringing it up! The problem is that it has a cost I'd rather not pay.

The problem is that batching only works if you don't send anything until the end of the batch. And that gives you an awful tradeoff - the more "live" your editing system is, the bigger the overhead. Its really nice seeing someone type live. Seeing their changes "pop in" after 5 seconds doesn't feel great. It doesn't feel current. And I think its too slow for pair programming over zoom and things like that. It also makes the DAG of changes much more complicated, because there's a much higher chance that changes are concurrent when there's that much latency. And that in turn also makes other aspects of the system do a lot more work to merge.

With hashing alone, you don't need to batch changes like that. Instead, you can just store hashes on disk for (for example) every 100 changes, and regenerate the changes in the middle when needed. (Though you need some other tricks so you can know where to look for a given hash).

Respect where its due, that would 100% solve it. I'm just looking for a way we can delete data without needing to give up the liveness.

> We should also permit the author of an operation to rollback the operation.

This doesn't completely solve the problem of hashes and guessability. Say in the 5 second batch I only type 1 character. We store some metadata, the character and the hash of the metadata & character. Then in the next 5 seconds I issue a rollback operation to delete that character. If I delete the character from durable storage, I can still figure out what the character was by brute forcing things and checking the hash for that batch.

I think thats fixable by adding junk data to every batch. I just bet there's ways to avoid doing that.

vivegi · on May 11, 2023

Are you using a rolling hash (such as the one employed in Rabin-Karp string search algorithm)? You can then store the hash at check points and the intermediate hashes between checkpoints can be regenerated using the rolling hash.

   +[h1] +[batch1] +[h2] [batch2] -[h3] +[batch3] +[h4]

You would then have the operations log for a replica as

   +[forwardhash(null)] +[hello] +[forwardhash(hello)] +[world] -[reversehash(world)] +[ ] +[forwardhash(wor) -[ld]+[k]

This sequence would give the following replica states after processing each hash:

   +[forwardhash(null)] +[hello] -> hello
   +[forwardhash(hello)] +[world] -> helloworld
   -[reversehash(world)] +[ ] -> hello world
   +[forwardhash(wor) -[ld]+[k] -> hello work

The direction of the roll is just a sign bit of the hash.

We don't need to encode positions separately as it is a function of the state of the replica and the sign of the hash.

gigatexal · on May 10, 2023

I like your approach to thinking about this. Seems like a plausible set of assumptions to make.

amelius · on May 10, 2023

I think keystrokes was just a metaphor for whatever operation the application has to deal with.

josephg · on May 11, 2023

No, the GP is right. Text CRDTs (including mine) literally generate an operation per inserted character in a collaborative text document. Diamond types solves the data explosion problem by not storing the operations separately. Instead, we internally run-length encoding everything.

If you have these two operations:

    Insert "h" at position 5, ID (seph, 2)
    Insert "i" at position 6, ID (seph, 3)

... They get immediately merged internally to become:

    Insert "hi" at position 5-6, ID (seph, 2-3)

This reduces the number of items we need to process by an order of magnitude. In practice, the metadata (positions, IDs, etc) ends up taking up about 0.1 bytes on disk per inserted character, which is awesome.

This works great for (agent, seq) style IDs because the IDs can also be run-length encoded. But we can't run-length encode hashes. And I don't want my file size to grow by 32x because we store a hash after every keystroke.

(I didn't invent this idea. Martin Kleppman first used it for storing CRDTs on disk, and Kevin Jahns made in-memory structures use the same tricks in yjs)

vivegi · on May 10, 2023

I was responding to the diamond-types CRDT author's question in the parent comment. Their github project page [1] mentions text editing a lot:

> This repository contains a high performance rust CRDT for text editing. This is a special data type which supports concurrent editing of lists or strings (text documents) by multiple users in a P2P network without needing a centralized server.

> This version of diamond types only supports plain text editing. Work is underway to add support for other JSON-style data types. See the more_types branch for details.

In any case, I agree with the metaphor and batching granular operations can always be done.

[1]: https://github.com/josephg/diamond-types