Info
This article was downloaded with a markdown extension. I’d like to give credit to the author but none is listed. There is a citation at the bottom linking to the article on the website I downloaded this from.
Abstract
Discover the best methods to avoid detection with Puppeteer! Scrape any web page online without getting blocked.
These days, websites use anti-bot systems that can detect scrapers. The best way to ensure a seamless scraping process is using proper masking methods, like headless browsers.
Puppeteer is a headless Chrome capable of simulating real user behavior to avoid anti-bots while web scraping. So how do you go about it?
In this article, we’ll discuss the best ways to avoid detection with Puppeteer while scraping. But before that…
What Is Puppeteer?
Puppeteer is a Node.js library that provides a high-level API to a Chromium headless browser programmatically.
You can easily install it using npm or Yarn, and one of its key benefits is its ability to access and manipulate DevTools Protocol.
Can Puppeteer be Detected by Anti-Bots?
Yes, these anti-bots are capable of detecting headless browsers like Puppeteer.
Let’s prove that with a quick scraping example of trying to crawl NowSecure. It’s a website with bots checking tests that tells you whether you have passed the protection.
To do that, we’ll go ahead and install Node.js, and when that’s done, we’ll run the following basic command code to install Puppeteer.
Next up, we’ll create a JavaScript file index.js
and run the file using Node.js node index.js
, like this:
So here’s what we did in that example: we used the basic Puppeteer setup to create a new browser page and visit the target website. And then, we take a screenshot after the security check.
Here’s the screenshot of the web page when we use Puppeteer only:
[
Click to open the image in full screen](https://static.zenrows.com/content/medium_nowsecure_blocked_125865ad10.webp)
As you can see from the result, we didn’t pass the check and couldn’t prevent Puppeteer detection on the web page.
Six Tricks to Avoid Detection with Puppeteer
One of the best ways to ensure a smooth crawling process is to avoid Puppeteer bot detection. Here’s how to prevent Puppeteer detection and avoid getting blocked while scraping:
1. Use Proxies
One of the most widely adopted anti-bot strategies is IP tracking, where the bot detection system tracks the website’s requests. And when an IP makes many requests in a short period, the anti-bot can detect the Puppeteer scraper.
To avoid detection in Puppeteer, you can make use of proxies, which provide a gateway between users and the internet. So when a request is sent to the server, it’s routed to the proxy, and then the response data is sent to us.
To do this, we can add a proxy to the args
parameter while launching Puppeteer like this:
That’s it! Your Puppeteer scraper can now avoid bot detection while scraping web pages.
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Headers contain context and metadata information about the HTTP request. It identifies whether the tool is a normal web browser or a bot. You can help prevent detection by including the right headers in the HTTP request.
Since Puppeteer works under headlessChrome
by default, you can take it further by adding custom headers like User-Agent. It’s a popular header used in web scraping, and it identifies the application, operating system, vendor and version of the request.
There are several methods to add a header in Puppeteer, but the easiest one is to add it while opening a new page.
3. Limit Requests
As previously discussed, an anti-bot can track a user’s activity through the number of requests sent. And since normal users don’t send hundreds of requests per second, limiting the number of requests and taking breaks between requests helps avoid Puppeteer detection.
To do this, you can use the .setRequestInterception()
to limit the resources rendered in Puppeteer.
By setting the .setRequestInterception() = true
, we ignore requests that Puppeteer makes for images. In this way, we can limit requests. Plus, we’ll get faster scrapers since there are fewer resources to load and wait for.
4. Mimic User Behavior
Machine learning systems study all user behaviors and compare bot behaviors to users’. Mimicking this behavior can help bypass bot detectors.
One way to achieve this is by waiting a random time before clicking or navigating to a new web page in Puppeteer. For example, this simple function generates a random time between five and 12 seconds.
The page.waitForTimeout
uses the generated time before opening a new tab page.
5. Puppeteer-Stealth
The stealth plugin is helpful for Puppeteer scrapers because it provides an API comparable to Puppeteer’s. By removing the minute variations in browser fingerprints between Headless Chrome and real Chrome browsers, it can hide the browser’s headless status.
Let’s see how to prevent detection using Puppeteer-Stealth!
Step 1: Install Puppeteer-Stealth
You can do that with the following command code:
Step 2: Configure Puppeteer-Stealth
To configure puppeteer-stealth, create a new JavaScript file called index.js
and add the following code:
Here’s what happened in the code above:
First, we imported puppeteer
and puppeteer-extra
, a lightweight wrapper around Puppeteer. Then, we also imported puppeteer-extra-plugin-stealth
to avoid Puppeteer detection. It’s used with puppeteer.use(pluginStealth())
.
puppeteer
is saved in executablePath
. That informs puppeteer-stealth
where the code will be executed. And finally, we launched Puppeteer with the option headless:false
to see what happens in the browser.
Step 3: Take a Screenshot
After launching the browser, create a new page using await browser.newPage()
and set the browser view. Puppeteer was instructed to visit the target website and wait ten seconds for the security check. After that, a screenshot was taken, and the browser was closed.
And here’s what the result should look like:
[
Click to open the image in full screen](https://static.zenrows.com/content/large_nowsecure_success_5cd6cc1848.webp)
Congratulations! You have prevented Puppeteer bot detection. Let’s take it a step further and scrape the page.
Step 4: Scrape the Page 😀
Find the selector of each element you want to scrape, and to do this, right-click on an element and select “Inspect”. That opens the Chrome DevTools, and you can get the selectors from the Elements tab.
[
Click to open the image in full screen](https://static.zenrows.com/content/large_nowsecure_devtools_058496b9a3.jpg)
Copy selectors and use each selector to get its text. The querySelector
method works perfectly for this purpose:
The script uses the .textContent
method to get the text for each element. The same process can be repeated for each element and saved to a variable. After running the code here, this is what the output should look like:
[
Click to open the image in full screen](https://static.zenrows.com/content/medium_nowsecure_text_268bb1e6da.webp)
Boom! You have solved your main problem and successfully prevented Puppeteer bot detection. Here’s what the complete code should be:
Limitations of Puppeteer-Stealth
Puppeteer-Stealth is a good solution to avoid bot detection, but it has limitations:
-
It can’t avoid advanced anti-bots.
-
Puppeteer works under headless mode, so it’s difficult to scale and scrape large amounts of data.
-
It’s difficult to debug headless browsers such as Puppeteer.
Unfortunately, Puppeteer-Stealth fails against websites with advanced anti-bots. For example, we tried to scrape Okta with it:
Here’s what we got:
[
Click to open the image in full screen](https://static.zenrows.com/content/large_okta_blocked_abe31a125b.webp)
We got blocked straight off! Fortunately, we have a solution that can bypass advanced anti-bots: ZenRows. That is our next tip to avoid detection.
6. ZenRows
ZenRows is an all-in-one web scraping tool that uses a single API call to handle all anti-bot bypassing. It helps with rotating proxies, headless browsers and CAPTCHAs.
Using ZenRows to scrape Okta, here’s what we got:
[
Click to open the image in full screen](https://static.zenrows.com/content/large_zenrows_okta_success_595a60e0cd.webp)
Conclusion
There are different methods to avoid detection with Puppeteer, and we discussed the best and easiest ways to go about it in this article.
You can use proxies, headers, limit requests or Puppeteer-Stealth to get the job done but there are limitations. The common problem with these methods is that they fail when it comes to bypassing advanced anti-bots.
ZenRows handles all anti-bot bypass for you, from rotating proxies and headless browsers to CAPTCHAs with a single API call. And you can get started for free.
Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you. *****