Ask HN: Securely finding the intersection of two lists?

klavinski · on Dec 12, 2019

What about exchanging the hashes of the customers? A common hash would mean a common customer.

"What if he sends back my list of hashes, although he has no common customer?"

I would do as follows:

1. Both pick a hashing function (f and g).

2. Both share their hashing function.

3. Both hash their list with two functions.

4. Both send the list of hashes computed with the other's function.

5. Common hashes <=> common customer.

For example, my list: Amanda, Patrick

His list: Patrick, Stanley

I choose f, he chooses g.

My list |> f = 12, 54 (mine)

My list |> g = 36, 2 (public)

His list |> f = 54, 7 (public)

His list |> g = 2, 68 (his)

I compare both lists with f, and he does the same with g. I conclude the only common customer is Patrick. He would not be able to send 54 without knowing Patrick or reverse-engineering the function (or being pretty lucky).

yorwba · on Dec 12, 2019

That scheme has the weakness that if the space of possible values is easy to enumerate (e.g. by buying lists of leaked email addresses off the black market), you can hash them all and find out whether they're on the other person's list without their knowledge, and you can do that at any later point without their cooperation.

gshdg · on Dec 14, 2019

What possible solution to this problem isn’t vulnerable to the first part of this, tho?

yorwba · on Dec 15, 2019

This is essentially a "secure multi-party computation" problem: if you had a trusted third party that both sides can communicate with using encrypted channels, they could just send their lists to the third party and get the intersection back.

There are standard techniques in secure multi-party computation that can be used to eliminate the need for a trusted third party by replacing it with a cryptographic communication protocol that can only be executed by all participants acting together, for example garbled circuits [0]. Basically, one participant creates an encrypted representation of the computation as a circuit and sends it to the other participant, who can execute the circuit on encrypted inputs to get an encrypted result, which only the sender of the circuit can decrypt, so they need to be sent back. IIRC, there's no way to prevent the sender from simply not sharing the decrypted results with the receiver, but this is super obvious, so it can't happen in secret. Because the encryption prevents either party from executing the complete computation on their own, it's also not possible to secretly try multiple inputs to extract additional information.

The one attack this can't protect against is if one party lies about the contents of the list. (Note that this is different from reporting one's own list honestly, but trying to also get the other list.) While this can't be prevented, you could require the size of each input list to be part of the output, which would make this kind of cheating detectable as well.

I'm sure the research literature on this problem also has some more specific techniques that would be preferable to the generic approach described above.

[0] https://en.wikipedia.org/wiki/Garbled_circuit

aaronharnly · on Dec 12, 2019

Yeah that’s a good technique: unfortunately yorwaba’s point applies for my situation — sending the hashed list makes it vulnerable to offline attack by enumerating the source data. That is, my partner can take my hashed list, generate all possible customer IDs, and then recover my full list.

If there were functions such that we could keep our hash functions private, and yet f(g(x)) = g(f(x)), that would solve it, which I guess is what the fancy techniques accomplish.

aaronharnly · on Dec 11, 2019

This is one of the many papers on the topic of algorithms to compute set intersection while revealing little or no additional information - but I haven't seen anyone running this as a service or writing a tool to make it possible for two parties:

https://pdfs.semanticscholar.org/389d/1af665a5111306e1f118ff...

kkm · on Dec 12, 2019

There is some similar work in the space of checking if the credentials being used by the user are compromised or not.

Example: https://arxiv.org/pdf/1905.13737.pdf

gwittel · on Dec 13, 2019

I haven’t seen an off the shelf thing. It might be possible with PKI incantations but the first that comes to mind without a third party is homomorphic encryption.

Libs like MS SEAL only do limited numeric operations. Without reading the research, that seems fine. Hash each name to a number. Encrypt, then do a compare. (Lots of hand waving here).

Another possible scheme are things like the Google safe browsing update/client side protocol. There you’re working against hashes only. Depending on needs it may be good enough.

eb0la · on Dec 12, 2019

If you can tolerate some error, a bloom filter might work.

Both parties make their own bloom filter and send the results to the other party.

The good part is no personal data goes inside. I mean, you cannot tell beforehand if someone is a customer until you try to store their data into the filter.

Bad news: bloom filters have erros. So, before doing this better guess the monetary value of the error.