Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"would be required on a per user basis or track the access rights along with the content, which is infeasible and does not scale"

Citation needed.

Most enterprise (homegrown or not) search engine products have to do this, and have been able to do it effectively at scale, for decades at this point.

This is a very well known and well-solved problem, and the solutions are very directly applicable to the products you list.

It is, as they say, a simple matter of implementation - if they don't offer it, it's because they haven't had the engineering time and/or customer need to do it.

Not because it doesn't scale.



If you're stringing together a bunch of MCPs you probably also have to string together a bunch of authorization mechanisms. Try having your search engine confirm live each persons access to each possible row.

It's absolutely a hard problem and it isn't well solved


Yes, if you try to string together 30 systems with no controls and implement controls at the end it can be hard and slow - "this method i designed to not work doesn't work" is not very surprising.

But the reply i made was to " This means Vector databases, Search Indexes or fancy "AI Search Databases" would be required on a per user basis or track the access rights along with the content, which is infeasible and does not scale."

IE information retrieval.

Access control in information retrieval is a very well studied.

Making search engines, etc that effectively confirm user access to each possible record is feasible and common (They don't do it exactly this way but the result is the same), and scalable.

Hell, we even known how to do private information retrieval with access control in scalable ways.

PIR = the server does not know what the query was, or the result was, but still retrieves the result.

So we know how to make it so not only does the server does not know what was queried or retrieved by a user, but each querying user still only can access records they are allowed to.

Overhead of this, which is much harder than non-private information retrieval with access control, is only 2-3x in computation. See, e.g., https://dspace.mit.edu/handle/1721.1/151392 for one example of such a system. There are others.

So even if your 2ms retrieval latency was all CPU and 0 I/O, it would only become 4-6ms do to this.

If you remove the PIR part, as i said, it's much easier, and the overhead is much much less, since it doesn't involve tons and tons of computationally expensive encryption primitives (though some schemes still involve some).


I don't know the details, but I know if I give our enterprise search engine/api a user's token it only returns documents they are allowed to access.


Do you know papers or technical reports that demonstrate the scalability of authorization-preserving search indexes?

I don't doubt they exist but what we hear about are the opposite cases, where this was obviously not implemented and sensitive data was leaked.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: