Archive.org has a special exemption via US copyright law, I believe, and other parties will not have the same exemption.
I imagine it would be very difficult for someone to start a website which hosts a lot of copyrighted material and claim they are genuinely archiving copyrighted material. This would be the defense of every piracy site ever, if it were feasible, I imagine.
I am not a copyright lawyer, and I welcome correction on this.
Developing an archival API for those who want their site archived is perfectly fine, though this is probably what robots.txt is for.
I don't think there is such a thing as a special exception to copyright law given to one website.
There is fair use.
I think what happened with archive.org is that it became popular, and also popular to think of it as fair use. It's a social phenomenon of acceptance that does not have any legal bearing.
Companies that don't want their stuff 'archived' can and do take action to enforce laws about digital libraries. For example the book that helped teach me programming was Turbo Pascal DiskTutor. You cannot simply download that one on archive.org. You have to get on a waiting list and 'borrow' it when it's available.
The fact that they made it so there is apparently exactly one digital copy available for 'borrowing' makes me feel that the digital library laws are invalid. It should not be legal to enforce only one copy total when it is possible to make it ten just as easily.
Anyway, there are lots of sites like YouTube that would not exist without encouraging users to violate copyright. This was the whole reason YouTube got big in the first place. It was only after they had a massive library of content and users that they started really playing ball with distribution companies.
Archive.org has a special exemption granted by the Library of Congress to break copyright protection for the purpose of making archives; and I believe it does so only when the copyright holder cannot be identified to ask for permission after a reasonable attempt has been made. It does not have a blank check to archive the entire Internet without permission -- if it did, you'd be able to read every New York Times article published since they went online there.
No, the Archive collects web pages that are publicly available. We do not archive pages that require a password to access, pages that are only accessible when a person types into and sends a form, or pages on secure servers. Pages may not be archived due to robots exclusions and some sites are excluded by direct site owner request."""
They don't really "get around" it, as mentioned in the parent comment they worked to get legal exceptions for many things. As you point out though I imagine they wouldn't have gotten the exceptions had they not been a 501c3 org.
I imagine it would be very difficult for someone to start a website which hosts a lot of copyrighted material and claim they are genuinely archiving copyrighted material. This would be the defense of every piracy site ever, if it were feasible, I imagine.
I am not a copyright lawyer, and I welcome correction on this.
Developing an archival API for those who want their site archived is perfectly fine, though this is probably what robots.txt is for.