Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Securely finding the intersection of two lists?
12 points by aaronharnly on Dec 11, 2019 | hide | past | favorite | 9 comments
I've now three or four times run into the situation where I want to securely and privately find the intersection of a list I have, with a list someone else has – but without either of us seeing one another's full list.

For example, I want to find discover customers with a prospective partner. But they don't want to send us their complete customer list, and we don't want to send ours to them.

There is CS research into the topic of "Private Set Intersection Protocols", but I haven't found a working implementation. And I could pay an accounting or legal firm to do this manually, but that will cost more money and time than it's usually worth.

Have you ever heard of somebody doing an escrow service for this, where party A and party B each send their lists to the service, the esrvice verify the size of each list (so we're not just spamming the service), and then they send back the matches to both parties? Or a tool simple enough to send to someone else to run locally?



What about exchanging the hashes of the customers? A common hash would mean a common customer.

"What if he sends back my list of hashes, although he has no common customer?"

I would do as follows:

1. Both pick a hashing function (f and g).

2. Both share their hashing function.

3. Both hash their list with two functions.

4. Both send the list of hashes computed with the other's function.

5. Common hashes <=> common customer.

For example, my list: Amanda, Patrick

His list: Patrick, Stanley

I choose f, he chooses g.

My list |> f = 12, 54 (mine)

My list |> g = 36, 2 (public)

His list |> f = 54, 7 (public)

His list |> g = 2, 68 (his)

I compare both lists with f, and he does the same with g. I conclude the only common customer is Patrick. He would not be able to send 54 without knowing Patrick or reverse-engineering the function (or being pretty lucky).


That scheme has the weakness that if the space of possible values is easy to enumerate (e.g. by buying lists of leaked email addresses off the black market), you can hash them all and find out whether they're on the other person's list without their knowledge, and you can do that at any later point without their cooperation.


What possible solution to this problem isn’t vulnerable to the first part of this, tho?


This is essentially a "secure multi-party computation" problem: if you had a trusted third party that both sides can communicate with using encrypted channels, they could just send their lists to the third party and get the intersection back.

There are standard techniques in secure multi-party computation that can be used to eliminate the need for a trusted third party by replacing it with a cryptographic communication protocol that can only be executed by all participants acting together, for example garbled circuits [0]. Basically, one participant creates an encrypted representation of the computation as a circuit and sends it to the other participant, who can execute the circuit on encrypted inputs to get an encrypted result, which only the sender of the circuit can decrypt, so they need to be sent back. IIRC, there's no way to prevent the sender from simply not sharing the decrypted results with the receiver, but this is super obvious, so it can't happen in secret. Because the encryption prevents either party from executing the complete computation on their own, it's also not possible to secretly try multiple inputs to extract additional information.

The one attack this can't protect against is if one party lies about the contents of the list. (Note that this is different from reporting one's own list honestly, but trying to also get the other list.) While this can't be prevented, you could require the size of each input list to be part of the output, which would make this kind of cheating detectable as well.

I'm sure the research literature on this problem also has some more specific techniques that would be preferable to the generic approach described above.

[0] https://en.wikipedia.org/wiki/Garbled_circuit


Yeah that’s a good technique: unfortunately yorwaba’s point applies for my situation — sending the hashed list makes it vulnerable to offline attack by enumerating the source data. That is, my partner can take my hashed list, generate all possible customer IDs, and then recover my full list.

If there were functions such that we could keep our hash functions private, and yet f(g(x)) = g(f(x)), that would solve it, which I guess is what the fancy techniques accomplish.


This is one of the many papers on the topic of algorithms to compute set intersection while revealing little or no additional information - but I haven't seen anyone running this as a service or writing a tool to make it possible for two parties:

https://pdfs.semanticscholar.org/389d/1af665a5111306e1f118ff...


There is some similar work in the space of checking if the credentials being used by the user are compromised or not.

Example: https://arxiv.org/pdf/1905.13737.pdf


I haven’t seen an off the shelf thing. It might be possible with PKI incantations but the first that comes to mind without a third party is homomorphic encryption.

Libs like MS SEAL only do limited numeric operations. Without reading the research, that seems fine. Hash each name to a number. Encrypt, then do a compare. (Lots of hand waving here).

Another possible scheme are things like the Google safe browsing update/client side protocol. There you’re working against hashes only. Depending on needs it may be good enough.


If you can tolerate some error, a bloom filter might work.

Both parties make their own bloom filter and send the results to the other party.

The good part is no personal data goes inside. I mean, you cannot tell beforehand if someone is a customer until you try to store their data into the filter.

Bad news: bloom filters have erros. So, before doing this better guess the monetary value of the error.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: