better web scraping with node.js

by Christian Fei @ 2019-12-11

322 words, 2 minutes reading time

#post #javascript #featured #javascript #nodejs #general 

Web scraping?

Web scraping [...] is used for extracting data from websites

from Wikipedia "Web scraping"

introducing mega-scraper

it's been a month since I started working on mega-scraper.

mega-scraper is meant to make scraping a website better

it is based on the popular Puppeteer API to interact with a Chromium instance, a web browser.

the scraping queue is based on Redis and can be monitored using bull-dashboard.

I built it because I felt the need for a better way to do scraping.



reliable scraping

how to make scraping more reliable and less detectable by anti-scraping shields?

I think the way to go is to simulate a real user using a real browser.

it also comes in handy when debugging and inspect updated CSS selectors or understand how to avoid unexpected modals or solve captcha pages.

you could even simulate a legit user session by having a pool of legit cookies.

the possibilities are wider if you try to surf a website as similar as possible to a real user browsing a product page, with eased step timeouts, random scrolling of a page, etc.

why not, even login to a given page with a real customer account to almost undetectably scrape its content.

fast scraping

blocking trackers by default.

avoiding loading images, stylesheets, if possible javascript speed up the scraping A LOT!

being able to proxy each request can also help in case of speed, since you're using multiple services to handle your requests.

it's all about experiments

mega-scraper itself needs lots of improvements and new creative ways to avoid (even solve) captchas, improve networking, generic pagination, automation data extraction and much more.

open-source and npm package

mega-scraper is available on and can be monitored using


both are available as npm packages 📦





let me know if you find ways to improve web scraping by opening a pull-request on GitHub at and also, let me know on Twitter @christian_fei what you think!

FEEDBACK @christian_fei

Featured blog posts

Twitter oauth by example in node.js So long, and thanks for all the veggies Simplest vanilla javascript static site blog search for jekyll, hugo, 11.ty Minimal environments with dotenv and node.js Connect to mongodb with monk in node.js Lazy loading images in 2020 Recover from failed `lerna` publish The cleanest way to maintain connect / express middlewares in node.js Better web scraping with node.js My blogging stack in 2020