Hacker Newsnew | past | comments | ask | show | jobs | submit | liberalgeneral's commentslogin

Z-Library has been innovating a great deal in that regard. Sadly they are not as open/sharing as LibGen mirrors in giving back to the community (in terms of database dumps, torrents, and source code).


I think Sci-Hub is the opposite since 1 DOI = 1 PDF in its canonical form (straight from the publisher) so neither duplication nor low-quality is the case.


It does depend on when the work was published. Pre-digital works scanned in without OCR can be larger in size. That's typically works from the 1980s and before.

Given the explosion of scientific publishing, that's likely a small fraction of the archive by work though it may be significant in terms of storage.


I don't think you've deserved the downvotes, and I don't think it's a bad idea either; indeed some coordination as to how to seed the collection is really needed.

For instance phillm.net maintains a dynamically updated list of LibGen and Sci-Hub torrents with less than 3 seeders so that people can pick some at random and start seeding: https://phillm.net/libgen-seeds-needed.php


Database dumps are available here if you are interested: http://libgen.rs/dbdumps/

libgen_compact_* is what you are probably looking for, but they are all SQL dumps so you'll need to import them into MySQL first. :/


the dumps are not enough, one has too scan the actual file content to assess the quality

are you alone in your analysis or are there groups who try to improve lg ?


Such efforts have been made in the past but every time ceased at some point for complexity. A workgroup can be made to tackle it, though.



Yes, that makes you a data hoarder. Normal people would just use one of the many other methods of getting free books, like legal libraries, googling it on Yandex, torrents, asking a friend, etc. Or just actually pay for a book.


My target audience is not normal people though, and I don't mean this in the "edgy" sense. The fact that we are having this discussion is very abnormal to begin with, and I think it's great that there are some deviants from the norm who care about the longevity of such projects.

I can imagine many students and researchers hosting a mirror of LibGen for their fellows for example.


In that case, just pay whatever it costs to store the data. With AWS glacier it would cost $50 a month.


> While I'd love to mirror whole archive locally, it would really be superfluous because I can only read a couple of quality books at a time anyway, [...]

I'd love to agree but as a matter of fact LibGen and Sci-Hub are (forced to be) "pirates" and they are more vulnerable to takedowns than other websites. So while I feel no need to maintain a local copy of Wikipedia, since I'm relatively certain that it'll be alive in the next decade, I cannot say the same about those two with the same certainty (not that I think there are any imminent threats to either, just reasoning a priori).


Speaking of mirroring, is there a way to download one big "several-hundred-GB" blob with the full content of the sites for archival purposes?

Surely that would act as a failsafe to your problem.


I think it's split into a several different torrents since it's so big.


Well when a site claims it's for scientific research articles, and you search for "Game Of Thrones" and find this:

https://libgen.is/search.php?req=game+of+thrones&lg_topic=li...

Someone's going to prison eventually, like The Pirate Bay founders. It's only a matter of time.


First, SciHub != LibGen. Allied projects that clearly share a support base but not identical.

Second, please provide a citation for the assertion that sharing copies of printed fiction erodes sales volume. At this point, one may assume that anything that helps to sell computer games and offline swag is cash-in-bank for content producers. Whether original authors get the same royalties is an interesting question.

Third, the former Soviet milieu probably isn't currently in the mood to cooperate with western law enforcement.


Even what you call LibGen isn't LG. These are LG forks, actually running against LG and pretending to be LG. LG was set up to create other libraries on its basis. Each of the forks aggressively fights for own dominance in all ways, and they resist the development of other forks by naming themselves LG and sucking in all the funds to personal possession without public reporting. Being forks themselves, they have closed the open project for own ambitions and for money grab.

Their values are incompatible with LG, and all what's left similar is the external part of letting download books, without which there would be nothing useful to look at.

Yeah, and the herculean work is actually done outside such aggregators by myriads of smaller collections, digitizing, binding, processing, collecting, and channeling millions of handmade books into rivers of literature, for free and ready to grab. The growth is global and isn't relevant to what the forks do.

Sorry to tell.


One might suppose it's not organizationally transparent for good reason.


It used to be for a good reason, indeed, but not any longer.


> Has anyone ever stumbled across an executable on LibGen? The article mentioned finding them but I've never seen one.

Here is a list of .exe files in LibGen: https://paste.debian.net/hidden/1c82739a/

And a breakdown of file extensions: https://paste.debian.net/hidden/579e319c/

> And would be nice to have a local copy of most of the books.

Yes! That was my intention—I wasn't advocating for a purge of content but a leaner and more practical version would be amazing.


> Yes! That was my intention—I wasn't advocating for a purge of content but a leaner and more practical version would be amazing.

Your piece doesn't make that obvious at all, and given how many people here are misunderstanding that point, you might want to update it.


You are right, added a paragraph at the end.


So, 1000 exes and 500 isos (that may be problematic, but most probably aren't). Everything else seems to be what one would expect.

That's way cleaner than I could possibly expect. Do people manually review suspect files?


Thanks for the lists. I was genuinely curious about the exes. Nice to know where they originate. Interesting that over half of them have titles in Cyrillic. I guess not so many English language textbooks (with included CDs) have been uploaded with the data portion intact.


Thank you for your efforts!

To be clear, I am not advocating for the removal of any files larger than 30 MiB (or any other arbitrary hard limits). It'd be great of course to flag large files for further review, but the current software doesn't do a great job at crowdsourcing these kinds of tasks (another one being deduplication) sadly.

Given the very little amount of volunteer-power, I'm suggesting that a "lean edition" of LibGen can still be immensely useful to many people.


Files are a very bad unit to elevate in importance, and number of files or file size are really bad proxy metrics, especially without considering the statistical distribution of downloads (leave alone the question of what is more "important"!). Eg: Junk that’s less than the size limit is implicitly being valued over good content that happens to be larger in size. Textbooks & reference books will likewise get filtered out with higher likelihood — and that would screw students in countries where they cannot afford them (which might arguable be a more important audience to some, compared to those downloading comics). Etc.

After all this, the most likely human response from people who really depend on this platform would be to slice a big file into volumes under the size limit. Seems to be a horrible UX downgrade in the medium to long term for no other reason than satisfying some arbitrary metric of legibility[1].

Here's a different idea -- might it be worthwhile to convert the larger files to better compressed versions eg. PDF -> DJVU? This would lead to a duplication in the medium term, but if one sees a convincing pattern that users switch to the compressed versions without needing to come back to the larger versions, that would imply that the compressed version works and the larger version could eventually be garbage collected.

Thinking in an even more open-ended manner, if this corpus is not growing at a substantial rate, can we just wait out a decade or so of storage improvements before this becomes a non-issue? How long might it take for storage to become 3x, 10x, 30x cheaper?

[1]: https://www.ribbonfarm.com/2010/07/26/a-big-little-idea-call...


> can we just wait out a decade or so of storage improvements before this becomes a non-issue?

I'm not sure that there is anything on the horizon which would make duplicate data a 'non-issue'. Capacities are certainly growing, so within a decade we might see 100TB HDDs available and affordable 20TB SSDs. But that does not solve the bandwidth issues. It still takes a long, long time to transfer all the data.

The fastest HDD is still under 300MB/s which means it takes a minimum of 20 hours to read all the data off a 20TB HDD. That is if you could somehow get it to read the whole thing at the maximum sustained read speed.

SSDs are much faster, but it will always be easier to double the capacity than it is to double the speed.


The problem isn't the technology, it's the cost. Given a far larger budget, you wouldn't run the hard drives at anywhere near capacity, in order to gain a read speed advantage by running a ton in parallel. That'll let you read 20 TB in a hour if you can afford it. Put it this way; Netflix is able to do 4k video and that's far more intensive.


There's people that contribute to the LibGen ecosystem but unfortunately it in areas that don't really benefit the community. Users don't need another CLI tool for LibGen, nor does the community need another Bot. Unfortunately that's what folks do, make extensions, CLI tools and bots that benefit next to no one and release all over silly willy with no support.


So what have you done that benefits the community and why do you think you get to make that decision for others?


I'm not the OP that's calling for reducing Bloat. I don't use LibGen enough to give a care one way or the other.


If you are referring to my duplication comments, sure (but even then I believe there are duplicates of the exact same edition of the same book). Though the filtering by filesize is orthogonal to editions etc. so has nothing to do with that.


I agree. There are duplicates. I have seen it.

I have found the same book with multiple sized pdf, with same content. Someone maybe uploaded a poorly scanned pdf when the book was first released but later Someone else uploaded a OCRed version, but the first one just stayed hogging a large amount of storage.


How do you automate the process of figuring out which version is better? It's not safe to assume the smaller versions are always better, nor the inverse. Particularly for books with images, one version of the book may have passable image quality while the other compressed the images to jpeg mush. And there are considerations that are difficult to judge quantitatively, like the quality of formatting. Even something seemingly simple like testing whether a book's TOC is linked correctly entails a huge rats nest of heuristics and guesswork.


My usual heuristic is to take the version with the largest number of pages, or if there are several with the same number of pages, the one with the largest filesize. Obviously if someone is gaming this it won't work; it's trivial to insert mountains of noise into a PDF.


I don’t think anyone is arguing it can be fully automated, but automating the selection of books to manually review is certainly viable.


I usually prefer the scanned PDF in these cases, because the OCRed version often contains errors, and in cases where the book matters, those errors can be very difficult to detect (incorrect superscripts in equations and things like that). Sometimes it's so poorly scanned that I don't prefer the scan (especially a problem with scans by Google Books).


As the previous reply said, I've also seen duplicates while browsing. Would it be possible to let users flag duplicates somehow? It involves human unreliability, which is like automated unreliability, only different.


Here is the raw data if you are interested: https://paste.debian.net/hidden/77876d00/


Thanks. Here is a logarithmic plot as SVG: https://files.catbox.moe/zbf35r.svg

On a second thought, a logarithmic histogram might convey even more information, but that would require all file sizes to recompute the bin sizes.


Huh, this distribution is not the power law I would have expected. Maybe because it's limited to one media type (books)?


Well it's a log graph.


Yeah, and a power law type distribution would be a straight line on a log-log plot, which this is not.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: