博客
How to do Web Scraping with Puppeteer and NodeJS in 2024 | Puppeteer tutorial

How to do Web Scraping with Puppeteer and NodeJS in 2024 | Puppeteer tutorial

Logo of Capsolver

CapSolver Blogger

How to use capsolver

17-Apr-2024

Web scraping is a powerful technique used to extract data from websites. In this tutorial, we will explore how to perform web scraping using Puppeteer and Node.js, two popular technologies in the web development ecosystem. Puppeteer is a Node.js library that provides a high-level API for controlling headless Chrome or Chromium browsers. It allows us to automate browser actions, navigate through web pages, and extract the desired data. By combining Puppeteer with the flexibility of Node.js, we can build robust and efficient web scraping solutions. Let's dive into the steps involved in scraping websites using Puppeteer in 2024.

So What is Puppeteer?

Puppeteer is a cutting-edge framework that enables testers to conduct headless browser testing with Google Chrome. With Puppeteer testing, testers can execute JavaScript commands to interact with web pages, including actions like clicking links, filling out forms, and submitting buttons.

Developed by Google, Puppeteer is a Node.js library that allows for seamless control of headless Chrome through the DevTools Protocol. It provides a range of high-level APIs that facilitate automated testing, website feature development, debugging, element inspection, and performance profiling.

With Puppeteer, you can use (headless) Chromium or Chrome to open websites, fill forms, click buttons, extract data and generally perform any action that a human could when using a computer. This makes Puppeteer a really powerful tool for web scraping, but also for automating complex workflows on the web. Having a clear understanding of Puppeteer and its capabilities is invaluable for both testers and developers in the modern web development landscape.

What Are the Advantages of Using Puppeteer for Web Scraping?

Axios and Cheerio are excellent options for scraping with JavaScript. However, that poses two issues: crawling dynamic content and anti-scraping software. Since Puppeteer is a headless browser, it has no problem scraping dynamic content.
Also Puppeteer offers a series of significant advantages for web scraping:

  1. Headless Browser Automation: With Puppeteer, you can control a headless Chrome browser programmatically, enabling automation of browser actions like clicking, scrolling, filling forms, and data extraction without a visible browser window.

  2. Full Chrome Functionality and DOM Manipulation: Puppeteer provides access to Chrome's complete functionality, making it suitable for scraping modern websites with JavaScript-heavy content. You can easily interact with page elements, modify attributes, and perform actions such as clicking buttons or submitting forms.

  3. Simulated User Interactions and Event Capture: Puppeteer allows you to simulate user interactions and capture network requests and responses. This enables scraping of pages that require user input or dynamically load content through AJAX or WebSocket requests.

  4. Performance and Debugging Capabilities: Puppeteer's optimized Chrome engine ensures efficient scraping, and its integration with DevTools offers robust debugging and testing capabilities. You can debug web pages, log console messages, trace network activity, and analyze performance metrics.

In the following guides, I will explore the process of web scraping using Puppeteer and Node.js, along with integrating a cutting-edge CAPTCHA solving solution, Capsolver, to overcome one of the major challenges encountered during web scraping.

Bonus Code

A bonus code for top captcha solutions; CapSolver: WEBS. After redeeming it, you will get an extra 5% bonus after each recharge, Unlimited

How to Solve Captcha in Puppeteer using CapSolver while Web Scraping

The goal will be to solve the captcha located at recaptcha-demo.appspot.com using CapSolver.

Captcha Form

During the tutorial, we will take the following steps to solve the above Captcha:

  1. Install the required dependencies.
  2. Find the site key of the Captcha Form.
  3. Set up CapSolver.
  4. Solve the captcha.

Install Required Dependencies

To get started, we need to install the following dependencies for this tutorial:

  • capsolver-python: The official Python SDK for easy integration with the CapSolver API.
  • pyppeteer: pyppeteer is a Python port of Puppeteer.

Install these dependencies by running the following command:

python -m pip install pyppeteer capsolver-python

Now, Create a file named main.py where we will write the Python code for solving captchas.

touch main.py

Get Site Key of Captcha Form

The Site Key is a unique identifier provided by Google that uniquely identifies each Captcha.

To solve the Captcha, it is necessary to send the Site Key to CapSolver.

Let's find the Site Key of the Captcha Form by following these steps:

  1. Visit the Captcha Form.
Captcha Form
  1. Open Chrome Dev Tools by pressing Ctrl/Cmd + Shift + I.
  2. Go to the Elements tab and search for data-sitekey. Copy the attribute's value.
Site Key
  1. Store the Site Key in a secure place as it will be used in a later section when we submit the captcha to CapSolver.

Setup CapSolver

To solve captchas using CapSolver, you need to create a CapSolver account, add funds to your account, and obtain an API Key. Follow these steps to set up your CapSolver account:

  1. Sign up for a CapSolver account by visiting capsolver.com.
Sign Up
  1. Add funds to your CapSolver account using PayPal, Crypto Currencies, or other listed payment methods. Please note that the minimum deposit amount is $6, and additional taxes apply.
Add Funds
  1. Now, copy the API Key provided by CapSolver and store it securely for later usage.
Store API Key

Solving the Captcha

Now, we will proceed to solving the captcha using CapSolver. The overall process involves three steps:

  1. Launching the browser and visiting the captcha page using pyppeteer.
  2. Solving the captcha using CapSolver.
  3. Submitting the captcha response.

Read the following Code Snippets to understand these steps.
Launching the browser and visiting the captcha page:

# Launch the browser.
browser = await launch({'headless': False})

# Load the target page.
captcha_page_url = "https://recaptcha-demo.appspot.com/recaptcha-v2-checkbox.php"
page = await browser.newPage()
await page.goto(captcha_page_url)

Solving the captcha using CapSolver:

# Solve the reCAPTCHA using CapSolver.
capsolver = RecaptchaV2Task("YOUR_API_KEY")

site_key = "6LfW6wATAAAAAHLqO2pb8bDBahxlMxNdo9g947u9"
task_id = capsolver.create_task(captcha_page_url, site_key)
result = capsolver.join_task_result(task_id)

# Get the solved reCAPTCHA code.
code = result.get("gRecaptchaResponse")

Setting the solved captcha on the form and submitting it:

# Set the solved reCAPTCHA code on the form.
recaptcha_response_element = await page.querySelector('#g-recaptcha-response')
await page.evaluate(f'(element) => element.value = "{code}"', recaptcha_response_element)

# Submit the form.
submit_btn = await page.querySelector('button[type="submit"]')
await submit_btn.click()

Putting it all Together

Below is the complete code for the tutorial, which will solve the captcha using CapSolver.

import asyncio
from pyppeteer import launch
from capsolver_python import RecaptchaV2Task

# Following code solves a reCAPTCHA v2 challenge using CapSolver.
async def main():
    # Launch Browser.
    browser = await launch({'headless': False})

    # Load the target page.
    captcha_page_url = "https://recaptcha-demo.appspot.com/recaptcha-v2-checkbox.php"
    page = await browser.newPage()
    await page.goto(captcha_page_url)

    # Solve the reCAPTCHA using CapSolver.
    print("Solving captcha")
    capsolver = RecaptchaV2Task("YOUR_API_KEY")

    site_key = "6LfW6wATAAAAAHLqO2pb8bDBahxlMxNdo9g947u9"
    task_id = capsolver.create_task(captcha_page_url, site_key)
    result = capsolver.join_task_result(task_id)

    # Get the solved reCAPTCHA code.
    code = result.get("gRecaptchaResponse")
    print(f"Successfully solved the reCAPTCHA. The solve code is {code}")

    # Set the solved reCAPTCHA code on the form.
    recaptcha_response_element = await page.querySelector('#g-recaptcha-response')
    await page.evaluate(f'(element) => element.value = "{code}"', recaptcha_response_element)

    # Submit the form.
    submit_btn = await page.querySelector('button[type="submit"]')
    await submit_btn.click()

    # Pause the execution so you can see the screen after submission before closing the driver
    input("Captcha Submission Successfull. Press enter to continue")

    # Close Browser.
    await browser.close()

if __name__ == "__main__":
    asyncio.get_event_loop().run_until_complete(main())

Paste the above code into your main.py file. Replace YOUR_API_KEY with your API Key and run the code.

You will observe that the captcha will be solved, and you will be greeted with a success page.

How to Solve Captcha in NodeJS using CapSolver while Web Scraping

Prerequisites

  • Proxy (Optional)
  • Node.JS installed
  • Capsolver API key

Step 1: Install Necessary Packages

Execute the following commands to install the required packages:

npm install axios

Node.JS Code for solve reCaptcha v2 without proxy

Here's a Node.JS sample script to accomplish the task:

const axios = require('axios');

const PAGE_URL = ""; // Replace with your Website
const SITE_KEY = ""; // Replace with your Website
const CLIENT_KEY = "";  // Replace with your CAPSOLVER API Key

async function createTask(payload) {
  try {
    const res = await axios.post('https://api.capsolver.com/createTask', {
      clientKey: CLIENT_KEY,
      task: payload
    });
    return res.data;
  } catch (error) {
    console.error(error);
  }
}
async function getTaskResult(taskId) {
    try {
        success = false;
        while(success == false){

            await sleep(1000);
        console.log("Getting task result for task ID: " + taskId);
      const res = await axios.post('https://api.capsolver.com/getTaskResult', {
        clientKey: CLIENT_KEY,
        taskId: taskId
      });
      if( res.data.status == "ready") {
        success = true;
        console.log(res.data)
        return res.data;
      }
    }
  
    } catch (error) {
      console.error(error);
      return null;
    }
  }
  

async function solveReCaptcha(pageURL, sitekey) {
  const taskPayload = {
    type: "ReCaptchaV2TaskProxyless",
    websiteURL: pageURL,
    websiteKey: sitekey,
  };
  const taskData = await createTask(taskPayload);
  return await getTaskResult(taskData.taskId);
}
function sleep(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
}
async function main() {
  try {
   
      const response = await solveReCaptcha(PAGE_URL, SITE_KEY );
      console.log(`Received token: ${response.solution.gReCaptcharesponse}`);
        
    }
catch (error) {
    console.error(`Error: ${error}`);
  }

}
main();

👀 More information

Conclusion:

In this tutorial, we have learned how to solve captchas using CapSolver while performing web scraping with Puppeteer and Node.js. By leveraging CapSolver's API, we can automate the captcha-solving process and make web scraping tasks more efficient and reliable. Remember to comply with the terms and conditions of the websites you scrape and use web scraping responsibly

更多

web scraping captcha solving
解决爬虫时遇到的CAPTCHA最好的方法

在Web爬取过程中,遇到验证码可能会带来相当大的挑战。本文将探讨在Web爬虫过程中遇到的不同类型的CAPTCHA,并讨论解决CAPTCHA的最佳方法。

The other captcha

28-Dec-2023

web scraping captcha solver
如何解决在爬虫的过程中遇到的CAPTCHA?

在本文中,我们将探讨为什么在Web爬虫过程中会遇到CAPTCHA,并讨论解决Web爬虫中CAPTCHA问题的最佳方法,重点关注Capsolver的集成。

The other captcha

27-Dec-2023

如何识别Queue-it captcha验证码
如何识别Queue-it captcha验证码

Queue-it是一个平台,提供在线流量管理解决方案,其中包括三种CAPTCHA工具,以帮助减轻机器人和滥用问题:Google ReCAPTCHA、Google ReCAPTCHA Invisible和Queue-it CAPTCHA。

The other captcha

13-Jul-2023

如何解决AWS WAF Captcha亚马逊验证码
如何解决AWS WAF Captcha亚马逊验证码

总之,解决AWS WAF Captcha可能是一项艰巨的任务,但是通过capsolver.com的帮助,可以快速高效地完成。通过本文步骤,您可以轻松解决AWS WAF Captcha。

The other captcha

13-Jul-2023

使用 CapSolver 识别文字图像验证码
使用 CapSolver 识别文字图像验证码

图像验证码通常作为网站上识别人类用户和机器人的一种常见安全措施。这些验证码通常要求用户在图像或一系列图像中识别特定元素。在本篇博客文章中,我们将指导您如何使用 CapSolver 解决图像验证码。

The other captcha

27-Jun-2023

如何使用图像识别自动绕过/识别 Amazon WA Captcha (AWS WAF) 验证码
如何使用图像识别自动绕过/识别 Amazon WA Captcha (AWS WAF) 验证码

通过CapSolver绕过Amazon WAF是一个简单的过程。它涉及使用createTask方法创建任务并提供必要的细节。请记住使用正确的任务类型并在任务对象结构中提供所需的属性。

The other captcha

09-Jun-2023