better web scraping with node.js

by Christian Fei @ 2019-12-11

318 words, 2 minutes reading time

#post #js #featured #javascript #general 

this

Web scraping?

Web scraping [...] is used for extracting data from websites

from Wikipedia "Web scraping"

introducing mega-scraper

it's been a month since I started working on mega-scraper.

mega-scraper is meant to make scraping a website better

it is based on the popular Puppeteer API to interact with a Chromium instance, a web browser.

the scraping queue is based on Redis and can be monitored using bull-dashboard.

I built it because I felt the need for a better way to do scraping.

reliable scraping

how to make scraping more reliable and less detectable by anti-scraping shields?

I think the way to go is to simulate a real user using a real browser.

it also comes in handy when debugging and inspect updated CSS selectors or understand how to avoid unexpected modals or solve captcha pages.

you could even simulate a legit user session by having a pool of legit cookies.

the possibilities are wider if you try to surf a website as similar as possible to a real user browsing a product page, with eased step timeouts, random scrolling of a page, etc.

why not, even login to a given page with a real customer account to almost undetectably scrape its content.

fast scraping

blocking trackers by default.

avoiding loading images, stylesheets, if possible javascript speed up the scraping A LOT!

being able to proxy each request can also help in case of speed, since you're using multiple services to handle your requests.

it's all about experimenting

mega-scraper itself needs lots of improvements and new creative ways to avoid (even solve) captchas, improve networking, generic pagination, automation data extraction and much more.

open-source and npm package

mega-scraper is available on github.com/christian-fei/mega-scraper and can be monitored using github.com/christian-fei/bull-dashboard/.

assets/images/posts/mega-scraper/mega-scraper-github.png

both are available as npm packages 📦

mega-scraper

NPM

bull-dashboard

NPM

let me know if you find ways to improve web scraping by opening a pull-request on GitHub at github.com/christian-fei/mega-scraper and also, let me know on Twitter @christian_fei what you think!

this

Leave a comment

Better web scraping with node.js My blogging stack in 2020 Run cypress integration tests with github actions workflow Simple telegram message with github actions Deploy eleventy site with github actions on aws s3 Simple ad and trackers blocking with dns Build for a slow connection Upgrade mongodb 3 to 4 on ubuntu Publishing org scoped npm packages with travis A story about npm publish / unpublish