Z-Library has been innovating a great deal in that regard. Sadly they are not as open/sharing as LibGen mirrors in giving back to the community (in terms of database dumps, torrents, and source code).
I think Sci-Hub is the opposite since 1 DOI = 1 PDF in its canonical form (straight from the publisher) so neither duplication nor low-quality is the case.
It does depend on when the work was published. Pre-digital works scanned in without OCR can be larger in size. That's typically works from the 1980s and before.
Given the explosion of scientific publishing, that's likely a small fraction of the archive by work though it may be significant in terms of storage.
I don't think you've deserved the downvotes, and I don't think it's a bad idea either; indeed some coordination as to how to seed the collection is really needed.
For instance phillm.net maintains a dynamically updated list of LibGen and Sci-Hub torrents with less than 3 seeders so that people can pick some at random and start seeding: https://phillm.net/libgen-seeds-needed.php
Yes, that makes you a data hoarder. Normal people would just use one of the many other methods of getting free books, like legal libraries, googling it on Yandex, torrents, asking a friend, etc. Or just actually pay for a book.
My target audience is not normal people though, and I don't mean this in the "edgy" sense. The fact that we are having this discussion is very abnormal to begin with, and I think it's great that there are some deviants from the norm who care about the longevity of such projects.
I can imagine many students and researchers hosting a mirror of LibGen for their fellows for example.
> While I'd love to mirror whole archive locally, it would really be superfluous because I can only read a couple of quality books at a time anyway, [...]
I'd love to agree but as a matter of fact LibGen and Sci-Hub are (forced to be) "pirates" and they are more vulnerable to takedowns than other websites. So while I feel no need to maintain a local copy of Wikipedia, since I'm relatively certain that it'll be alive in the next decade, I cannot say the same about those two with the same certainty (not that I think there are any imminent threats to either, just reasoning a priori).
First, SciHub != LibGen. Allied projects that clearly share a support base but not identical.
Second, please provide a citation for the assertion that sharing copies of printed fiction erodes sales volume. At this point, one may assume that anything that helps to sell computer games and offline swag is cash-in-bank for content producers. Whether original authors get the same royalties is an interesting question.
Third, the former Soviet milieu probably isn't currently in the mood to cooperate with western law enforcement.
Even what you call LibGen isn't LG. These are LG forks, actually running against LG and pretending to be LG. LG was set up to create other libraries on its basis. Each of the forks aggressively fights for own dominance in all ways, and they resist the development of other forks by naming themselves LG and sucking in all the funds to personal possession without public reporting. Being forks themselves, they have closed the open project for own ambitions and for money grab.
Their values are incompatible with LG, and all what's left similar is the external part of letting download books, without which there would be nothing useful to look at.
Yeah, and the herculean work is actually done outside such aggregators by myriads of smaller collections, digitizing, binding, processing, collecting, and channeling millions of handmade books into rivers of literature, for free and ready to grab. The growth is global and isn't relevant to what the forks do.
Thanks for the lists. I was genuinely curious about the exes. Nice to know where they originate. Interesting that over half of them have titles in Cyrillic. I guess not so many English language textbooks (with included CDs) have been uploaded with the data portion intact.
To be clear, I am not advocating for the removal of any files larger than 30 MiB (or any other arbitrary hard limits). It'd be great of course to flag large files for further review, but the current software doesn't do a great job at crowdsourcing these kinds of tasks (another one being deduplication) sadly.
Given the very little amount of volunteer-power, I'm suggesting that a "lean edition" of LibGen can still be immensely useful to many people.
Files are a very bad unit to elevate in importance, and number of files or file size are really bad proxy metrics, especially without considering the statistical distribution of downloads (leave alone the question of what is more "important"!). Eg: Junk that’s less than the size limit is implicitly being valued over good content that happens to be larger in size. Textbooks & reference books will likewise get filtered out with higher likelihood — and that would screw students in countries where they cannot afford them (which might arguable be a more important audience to some, compared to those downloading comics). Etc.
After all this, the most likely human response from people who really depend on this platform would be to slice a big file into volumes under the size limit. Seems to be a horrible UX downgrade in the medium to long term for no other reason than satisfying some arbitrary metric of legibility[1].
Here's a different idea -- might it be worthwhile to convert the larger files to better compressed versions eg. PDF -> DJVU? This would lead to a duplication in the medium term, but if one sees a convincing pattern that users switch to the compressed versions without needing to come back to the larger versions, that would imply that the compressed version works and the larger version could eventually be garbage collected.
Thinking in an even more open-ended manner, if this corpus is not growing at a substantial rate, can we just wait out a decade or so of storage improvements before this becomes a non-issue? How long might it take for storage to become 3x, 10x, 30x cheaper?
> can we just wait out a decade or so of storage improvements before this becomes a non-issue?
I'm not sure that there is anything on the horizon which would make duplicate data a 'non-issue'. Capacities are certainly growing, so within a decade we might see 100TB HDDs available and affordable 20TB SSDs. But that does not solve the bandwidth issues. It still takes a long, long time to transfer all the data.
The fastest HDD is still under 300MB/s which means it takes a minimum of 20 hours to read all the data off a 20TB HDD. That is if you could somehow get it to read the whole thing at the maximum sustained read speed.
SSDs are much faster, but it will always be easier to double the capacity than it is to double the speed.
The problem isn't the technology, it's the cost. Given a far larger budget, you wouldn't run the hard drives at anywhere near capacity, in order to gain a read speed advantage by running a ton in parallel. That'll let you read 20 TB in a hour if you can afford it. Put it this way; Netflix is able to do 4k video and that's far more intensive.
There's people that contribute to the LibGen ecosystem but unfortunately it in areas that don't really benefit the community. Users don't need another CLI tool for LibGen, nor does the community need another Bot. Unfortunately that's what folks do, make extensions, CLI tools and bots that benefit next to no one and release all over silly willy with no support.
If you are referring to my duplication comments, sure (but even then I believe there are duplicates of the exact same edition of the same book). Though the filtering by filesize is orthogonal to editions etc. so has nothing to do with that.
I have found the same book with multiple sized pdf, with same content. Someone maybe uploaded a poorly scanned pdf when the book was first released but later Someone else uploaded a OCRed version, but the first one just stayed hogging a large amount of storage.
How do you automate the process of figuring out which version is better? It's not safe to assume the smaller versions are always better, nor the inverse. Particularly for books with images, one version of the book may have passable image quality while the other compressed the images to jpeg mush. And there are considerations that are difficult to judge quantitatively, like the quality of formatting. Even something seemingly simple like testing whether a book's TOC is linked correctly entails a huge rats nest of heuristics and guesswork.
My usual heuristic is to take the version with the largest number of pages, or if there are several with the same number of pages, the one with the largest filesize. Obviously if someone is gaming this it won't work; it's trivial to insert mountains of noise into a PDF.
I usually prefer the scanned PDF in these cases, because the OCRed version often contains errors, and in cases where the book matters, those errors can be very difficult to detect (incorrect superscripts in equations and things like that). Sometimes it's so poorly scanned that I don't prefer the scan (especially a problem with scans by Google Books).
As the previous reply said, I've also seen duplicates while browsing. Would it be possible to let users flag duplicates somehow? It involves human unreliability, which is like automated unreliability, only different.