I'm not sure why Python web scraping is so popular compared to Node.js web scraping. npm has some very well made packages for DOM parsing, and since it's in Javascript we have more native feeling DOM features (e.g. node-html-parser using querySelector instead of select - it just feels a lot more intuitive). It's super easy to scrape with Puppeteer or just regular html parsers on a Lambda.
Having done a lot of web scraping, the thing that often matters is string processing. Javascript/Node are fairly poor at this compared to Python, and lack a lot of the standard library ergonomics that Python has developed over many years. Web scraping in Node just doesn't feel productive. I'd imagine Perl is also good for those in that camp. I've also used Ruby and again it was nice and expressive in a way that JS/Node couldn't live up to. Lastly, I've done web scraping in Swift and that felt similar to JS/Node – much more effort to do data extraction and formatting, not without benefits of course.
I also suspect that DOM-like APIs are somewhat overrated here with regards to web scraping. JS/Node would only have an emulation of DOM APIs, or you're running a full web browser (which is a much bigger ask in terms of resources, deployment, performance, etc), and to be honest, lxml in Python is nice and fast. I generally found XPath much better for X(HT)ML parsing than CSS selectors, and XPath support is pretty available across a lot of different ecosystems.
Best web scraping guy I ever met (the type you hire when no one else can figure out how) was a Perl expert. I don’t know Perl so I don’t know why, but this is very real.
Yeah I'm not surprised. My previous company had scrapers for ~50 of our suppliers (so they didn't need to integrated with us), and I worked on/off on them for 7 years. It was a very different type of work to the product/infra work I spent most of my time doing.
One scraper is often not hugely valuable, most companies I've seen with scrapers have many scrapers. This means that the time investment available for each one is low. Some companies outsource this, and that can work ok. Then scrapers also break. Frequently. Website redesigns, platform moves, bot protection (yes, even if you have a contract allowing you to scrape, IT and BizDev don't talk to each other), the site moving to needing JavaScript to render anything on the page... they can all cause you to go back to the drawing board.
The concept of "tech debt" kinda goes out of the window when you rewrite the code every 6 months. Instead the value comes from how quickly you can write a scraper and get it back in production. The code can in fact be terrible because you don't really need to read it again, automated testing is often pointless because you're not going to edit the scraper without re-testing manually anyway. Instead having a library of tested utility functions, a good manual feedback loop, and quick deployments, were much more useful for us.
To me it's mainly the following three reasons, but take it with a grain of salt since my JS is not as fluent as Python.
1. the async nature of JS is surprisingly detrimental when writing scraping script. It's hard to describe, but it makes have a mental image of the whole code base or workflow harder. Writing mostly sync code and only use things like ThreadPoolExecutor (not even Threading directly) when necessary has been much easier for me to write clean, easy-to-maintain code.
2. I really don't like the syntax of loops or iterations in JS, and there are a lot of them in web scraping.
3. String processing and/or data re-shaping feels harder in JS. The built-in functions often feel unintuitive.
I've had some experiences with selenium and now I'm using puppeteer, and I honestly don't see the problem with JS. It's true that I have not much experience coding but it seems to me that Pupeteer + Flask serving ML to extract data is the cake. Also, being able to play around evaluating expressions in pupeteer, etc, makes it manageable.
Maybe I lack experience but I don't see JS being a barrier.
I would like to know what kind of string work are you doing. I can't imagine being dependent on parsing strings and such, that looks very easy to break, even easier that css selector dance.
String processing in general is just easier in Python.
As a basic example, `lstrip()` and `rstrip()` trim whitespace by default, but can also remove any characters you'd like from the end(s) of your string.
You'd need to code that up yourself in JS. Happens quite a lot.
One of the reasons is that after you scrape it you want to do something with the data: put it in a Postgres/SQLite, save it to disk, POST it to some webserver, extract some stats from it and write to a CSV, ...
Perhaps more of the people who need to run this kind of data scraping operation are comfortable with Python. Data scientists, operations personnel, etc.
I've been using Perl and Python for 30 years, and JS for a few weeks scattered across those same years.
Because it’s been around longer. Beautiful Soup was first released in 2004 according to its wiki page and I’m sure there were plenty of libraries before it.
> I'm not sure why Python web scraping is so popular compared to Node.js web scraping
Take this with a grain of salt, since I am fully cognizant that I'm the outlier in most of these conversations, but Scrapy is A++ the no-kidding best framework for this activity that has been created thus far. So, if there was scrapyjs maybe I'd look into it, but there's not (that I'm aware of) so here we are. This conversation often comes up in any such "well, I just use requests & ..." conversation and if one is happy with main.py and a bunch of requests invocations, I'm glad for you, but I don't want to try and cobble together all the side-band stuff that Scrapy and its ecosystem provide for me in a reusable and predictable way
Also, often those conversations conflate the server side language with the "scrape using headed browser" language which happens to be the same one. So, if one is using cheerio <https://github.com/cheeriojs/cheerio> then sure node can be a fine thing - if the blog post is all "fire up puppeteer, what can go wrong?!" then there is the road to ruin of doing battle with all kinds of detection problems since it's kind of a browser but kind of not
I, under no circumstances, want the target site running their JS during my crawl runs. I fully accept responsibility for reproducing any XHR or auth or whatever to find the 3 URLs that I care about, without downloading every thumbnail and marketing JS and beacon and and and. I'm also cognizant that my traffic will thus stand out since it uniquely does not make the beacon and marketing calls, but my experience has been that I get the ban hammer less often with my target fetches than trying to pretend to be a browser with a human on the keyboard/mouse but is not