> 1. We don’t want to store a hash for every keystroke a user makes.
So, why does it have to be every keystroke? We can batch them locally. An approximate rule of thumb to use to guide the batching is typing speed. Using 200 keystrokes/minute, we get an average of 3.33 keystrokes/second. If we use 3 seconds latency for the collaborative editing to resolve concurrent edits, we get 10 keystrokes per batch (3 secs). You can also have an override where smaller batches are replicated to nodes if sufficient time has elapsed since last local edit (eg: 10 secs have elapsed and we have a partial batch of 3 keystrokes only).
> 2. If I accidentally paste an AWS key into a document, I don’t want that key in the document history forever. How do I enable that while still getting all the benefits of the hashing system above?
In order for this to work, we should assume that every replica's current state is always obtained by starting with the empty document and then playing the operations log. We should also permit the author of an operation to rollback the operation. So, I suppose if the collaborative editor shows an interface of all operations that originated locally and the author can select the operations that inserted the AWS key into the replica, they can issue a rollback operations op and have it propagated to all replicas.
Since the point-in-time state of the replicas are regenerated by replaying the operations log, we would have deleted the accidentally inserted key.
In order to address the issue of the AWS key still being in the operations log, we can define the semantics of the rollback op to also include secure erasure of the previous operation from the operations log (i.e., the rollback op would update the original operation into a no-op.
So, when all replicas apply the rollback op initiated by the author, eventually all replicas will converge to having the accidentally inserted AWS key removed from the replica and the operations log.
> So, why does it have to be every keystroke? We can batch them locally.
This approach would absolutely work. Thanks for bringing it up! The problem is that it has a cost I'd rather not pay.
The problem is that batching only works if you don't send anything until the end of the batch. And that gives you an awful tradeoff - the more "live" your editing system is, the bigger the overhead. Its really nice seeing someone type live. Seeing their changes "pop in" after 5 seconds doesn't feel great. It doesn't feel current. And I think its too slow for pair programming over zoom and things like that. It also makes the DAG of changes much more complicated, because there's a much higher chance that changes are concurrent when there's that much latency. And that in turn also makes other aspects of the system do a lot more work to merge.
With hashing alone, you don't need to batch changes like that. Instead, you can just store hashes on disk for (for example) every 100 changes, and regenerate the changes in the middle when needed. (Though you need some other tricks so you can know where to look for a given hash).
Respect where its due, that would 100% solve it. I'm just looking for a way we can delete data without needing to give up the liveness.
> We should also permit the author of an operation to rollback the operation.
This doesn't completely solve the problem of hashes and guessability. Say in the 5 second batch I only type 1 character. We store some metadata, the character and the hash of the metadata & character. Then in the next 5 seconds I issue a rollback operation to delete that character. If I delete the character from durable storage, I can still figure out what the character was by brute forcing things and checking the hash for that batch.
I think thats fixable by adding junk data to every batch. I just bet there's ways to avoid doing that.
Are you using a rolling hash (such as the one employed in Rabin-Karp string search algorithm)? You can then store the hash at check points and the intermediate hashes between checkpoints can be regenerated using the rolling hash.
No, the GP is right. Text CRDTs (including mine) literally generate an operation per inserted character in a collaborative text document. Diamond types solves the data explosion problem by not storing the operations separately. Instead, we internally run-length encoding everything.
If you have these two operations:
Insert "h" at position 5, ID (seph, 2)
Insert "i" at position 6, ID (seph, 3)
... They get immediately merged internally to become:
Insert "hi" at position 5-6, ID (seph, 2-3)
This reduces the number of items we need to process by an order of magnitude. In practice, the metadata (positions, IDs, etc) ends up taking up about 0.1 bytes on disk per inserted character, which is awesome.
This works great for (agent, seq) style IDs because the IDs can also be run-length encoded. But we can't run-length encode hashes. And I don't want my file size to grow by 32x because we store a hash after every keystroke.
(I didn't invent this idea. Martin Kleppman first used it for storing CRDTs on disk, and Kevin Jahns made in-memory structures use the same tricks in yjs)
I was responding to the diamond-types CRDT author's question in the parent comment. Their github project page [1] mentions text editing a lot:
> This repository contains a high performance rust CRDT for text editing. This is a special data type which supports concurrent editing of lists or strings (text documents) by multiple users in a P2P network without needing a centralized server.
> This version of diamond types only supports plain text editing. Work is underway to add support for other JSON-style data types. See the more_types branch for details.
In any case, I agree with the metaphor and batching granular operations can always be done.
So, why does it have to be every keystroke? We can batch them locally. An approximate rule of thumb to use to guide the batching is typing speed. Using 200 keystrokes/minute, we get an average of 3.33 keystrokes/second. If we use 3 seconds latency for the collaborative editing to resolve concurrent edits, we get 10 keystrokes per batch (3 secs). You can also have an override where smaller batches are replicated to nodes if sufficient time has elapsed since last local edit (eg: 10 secs have elapsed and we have a partial batch of 3 keystrokes only).
> 2. If I accidentally paste an AWS key into a document, I don’t want that key in the document history forever. How do I enable that while still getting all the benefits of the hashing system above?
In order for this to work, we should assume that every replica's current state is always obtained by starting with the empty document and then playing the operations log. We should also permit the author of an operation to rollback the operation. So, I suppose if the collaborative editor shows an interface of all operations that originated locally and the author can select the operations that inserted the AWS key into the replica, they can issue a rollback operations op and have it propagated to all replicas.
Since the point-in-time state of the replicas are regenerated by replaying the operations log, we would have deleted the accidentally inserted key.
In order to address the issue of the AWS key still being in the operations log, we can define the semantics of the rollback op to also include secure erasure of the previous operation from the operations log (i.e., the rollback op would update the original operation into a no-op.
So, when all replicas apply the rollback op initiated by the author, eventually all replicas will converge to having the accidentally inserted AWS key removed from the replica and the operations log.