Blog
How to Solve Cloudflare when Web Scraping in 2024 | Step by Step Guide

How to Solve Cloudflare when Web Scraping in 2024 | Step by Step Guide

Logo of Capsolver

CapSolver Blogger

How to use capsolver

07-May-2024

How to Solve Cloudflare when Web Scraping in 2024 | Step by Step Guide

Common Cloudflare Status Codes when doing WebScraping

Error 1020

Cloudflare Error 1020 indicates that access has been denied. This error is triggered when a rule of the website's firewall, which is protected by Cloudflare, is violated. Various actions, such as making excessive requests to the site, can lead to this violation.

Common ways to fix this problem:

  • Employ a rotating proxy to mask your IP address.
  • Alter and rotate User-Agent headers.
  • Ensure your HTTP client supports TLS if utilizing the requests library.
  • Utilize browser automation tools like Puppeteer, Playwright, or Selenium.

Error 1015 / 429 rate limit

Cloudflare Error 1015 happens when your IP address is flagged and banned by Cloudflare for exceeding a website's rate limit during scraping activities. This can lead to encountering this error.
Common ways to fix this problem:

  • Use rotating proxies or a proxy pool that is big
  • Check that the website can't track by your headers, some headers could make them track you and rate limit your requests
  • Be sure that you are not getting fingerprinted from tls fingerprint, tcp, or any other way of fingerprint.

Error 403

A 403 status code is a Forbidden response status, issued by a server when it acknowledges a request as valid but refuses to fulfill it. This may occur due to missing necessary headers in your request, such as CORS, JWT, or authentication headers that the server expects.

If the website is generally accessible and adding the correct headers does not resolve the issue, it is possible that the server is detecting your requests as automated.
Common ways to fix this problem:

  • You are not sending the correct headers, request information.
  • Proxy has been banned
  • Need to solve a javascript challenge, check the blog about Cloudflare Challenge 5s to understand how to solve this
  • Website doesn't allow any traffic

Identify Cloudflare Challenge 5s and Cloudflare Turnstile Captcha

Cloudflare Challenge 5s

Cloudflare challenge looks like:

Sometimes, this page could have turnstile

Verify that you need to solve Cloudflare Challenge 5s and not just turnstile, for just turnstile, please keep reading this blog.

There are some requeriments when solving this challenge using Capsolver.

  • Proxy
  • Capsolver API Key

Submitting task information to Capsolver

POST https://api.capsolver.com/createTask
Host: api.capsolver.com
Content-Type: application/json

{
  "clientKey": "YOUR_API_KEY",
  "task": {
    "type": "AntiCloudflareTask",
    "websiteURL": "https://www.yourwebsite.com",
    "proxy": "158.120.100.23:334:user:pass"
  }
}

After submit correctly, API will return a taskId

{
    "errorId": 0,
    "taskId": "014fc55c-46c9-41c8-9de7-6cb35d984edc",
    "status": "idle"
}

Obtain this taskId value and use for retrieve the result using the getTaskResult method

Retrieve the result

POST https://api.capsolver.com/getTaskResult
Host: api.capsolver.com
Content-Type: application/json

{
  "clientKey": "YOUR_API_KEY",
  "taskId": "taskId"
}

Depending on the system load, you will get the results within the interval of 1s to 20s

If you receive ERROR_CAPTCHA_SOLVE_FAILED in the response, could be several reasons:

  • Your proxy don't need to solve cloudflare challenge 5s (Some websites just enable for bad proxies, bots actions or anything that could trigger that the request is made by a bot). Other times is enabled everytime, depends on the configuration.
  • Your proxy is banned by Cloudflare and it's in a loop that can't pass the challenge
  • Website don't use cloudflare challenge, verify that it's challenge and not turnstile, check the examples images.
  • Proxy is giving timeouts, this is common when using Residentials Proxy

If you receive a success response, will look like:

{
    "errorId": 0,
    "taskId": "d8d3a8b4-30cc-4b09-802a-a476ca17fa54",
    "status": "ready",
    "solution": {
        "accept-language": "en-US,en;q=0.9",
        "cookies": {

        },
        "headers": {
            "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
            "accept-encoding": "gzip, deflate, br",
            "accept-language": "en-US,en;q=0.9",
            "cache-control": "max-age=0",
            "content-type": "application/x-www-form-urlencoded",
            "origin": "",
            "referer": "",
            "sec-ch-ua": "\"Not_A Brand\";v=\"8\", \"Chromium\";v=\"120\", \"Google Chrome\";v=\"120\"",
            "sec-ch-ua-arch": "\"arm\"",
            "sec-ch-ua-bitness": "\"64\"",
            "sec-ch-ua-full-version": "\"120.0.6099.71\"",
            "sec-ch-ua-full-version-list": "\"Not_A Brand\";v=\"8.0.0.0\", \"Chromium\";v=\"120.0.6099.71\", \"Google Chrome\";v=\"120.0.6099.71\"",
            "sec-ch-ua-mobile": "?0",
            "sec-ch-ua-model": "\"\"",
            "sec-ch-ua-platform": "\"macOS\"",
            "sec-ch-ua-platform-version": "\"10.14.6\"",
            "sec-fetch-dest": "document",
            "sec-fetch-mode": "navigate",
            "sec-fetch-site": "same-origin",
            "sec-fetch-user": "?1",
            "upgrade-insecure-requests": "1",
            "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
        },
        "proxy": "your proxy",
        "token": "cf clearance token",
        "type": "challenge",
        "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
    }
}

From this response, you will need to parse the values of cookies, headers, token.

Your request will need to look like:

  • Headers of the request must be the same as the ones that we returned you in the response
  • Cookies of the request must be the same as the ones that we returned you in the response
  • Request client must support TLS settings, in this case use TLS Chrome 120
  • Use the same proxy for interact on the website and for use the cf_clearance cookie
  • Token value will be the cf_clearance cookie value that you will need to create

Example for solve Cloudflare Challenge with Python

# -*- coding: utf-8 -*-
import requests
import time
import tls_client

# TODO: Your api key
API_KEY = ""
proxy = ""

# TODO: Your target site url:
page_url = ''



def call_capsolver():
    data = {
        "clientKey": API_KEY,
        "task": {
            "type": 'AntiCloudflareTask',
            "websiteURL": page_url,
            "proxy": proxy,
        }
    }
    uri = 'https://api.capsolver.com/createTask'
    res = requests.post(uri, json=data)
    resp = res.json()
    task_id = resp.get('taskId')
    if not task_id:
        print("no get taskId:", res.text)
        return
    print('created taskId:', task_id)

    while True:
        time.sleep(1)
        data = {
            "clientKey": API_KEY,
            "taskId": task_id
        }
        response = requests.post('https://api.capsolver.com/getTaskResult', json=data)
        resp = response.json()
        status = resp.get('status', '')
        if status == "ready":
            print("successfully => ", response.text)
            return resp.get('solution')
        if status == "failed" or resp.get("errorId"):
            print("failed! => ", response.text)
            return


def request_site(solution):
    session = tls_client.Session(
        client_identifier="chrome_120",
        random_tls_extension_order=True
    )
    return session.get(
        page_url,
        headers=solution.get('headers'),
        cookies=solution.get('cookies'),
        proxy=proxy,
    )


def main():
    solution = {
        "headers": {
            "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
            "upgrade-insecure-requests": "1",
            "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
            "sec-fetch-site": "none",
            "sec-fetch-mode": "navigate",
            "sec-fetch-user": "?1",
            "sec-fetch-dest": "document",
            "accept-encoding": "gzip, deflate, br",
            "accept-language": "en-US,en;q=0.9",
        }
    }
    # first request (check your proxy):
    res = request_site(solution)
    print('1. response status code:', res.status_code)
    if res.status_code != 403:
        print("your proxy is good and didn't get the cloudflare challenge")
        return
    elif 'window._cf_chl_opt' not in res.text:
        print('==== proxy blocked ==== ')
        return

    # call capSolver:
    solution = call_capsolver()
    if not solution:
        return

    # second request (verify solution):
    res = request_site(solution)
    print('2. response status code:', res.status_code)


if __name__ == '__main__':
    main()

Cloudflare Turnstile Captcha

Cloudflare Turnstile Captcha looks like:

  • Managed challenge

  • Non-interactive challenge

  • Invisible challenge
    not visible, you can check on the network / scripts loaded and see if turnstile is used

Verify that you need to solve Cloudflare Turnstile Captcha and not Cloudflare Challenge 5s, for just Cloudflare Challenge, please keep reading this blog.

There are some requeriments when solving this challenge using Capsolver.

  • Capsolver API Key

Submitting task information to Capsolver

POST https://api.capsolver.com/createTask
Host: api.capsolver.com
Content-Type: application/json

{
  "clientKey": "YOUR_API_KEY",
  "task": {
    "type": "AntiTurnstileTaskProxyLess",
    "websiteURL": "https://www.yourwebsite.com",
    "websiteKey": "0x4XXXXXXXXXXXXXXXXX",
    "metadata": {
       "action": "login",  //optional
       "cdata": "0000-1111-2222-3333-example-cdata"  //optional
    }
  }
}

"action" and "cdata" is optional, sometimes will be required and sometimes not.
Depends on the configuration of the website.
action is the value of the data-action attribute of the Turnstile element if it exists.
cdata is the value of the data-cdata attribute of the Turnstile element if it exists.
After submit correctly, API will return a taskId

{
    "errorId": 0,
    "taskId": "014fc55c-46c9-41c8-9de7-6cb35d984edc",
    "status": "idle"
}

Obtain this taskId value and use for retrieve the result using the getTaskResult method

Retrieve the result

POST https://api.capsolver.com/getTaskResult
Host: api.capsolver.com
Content-Type: application/json

{
  "clientKey": "YOUR_API_KEY",
  "taskId": "taskId"
}

Depending on the system load, you will get the results within the interval of 1s to 20s

If you receive ERROR_CAPTCHA_SOLVE_FAILED in the response, could be several reasons:

  • Your proxy don't need to solve cloudflare challenge 5s (Some websites just enable for bad proxies, bots actions or anything that could trigger that the request is made by a bot). Other times is enabled everytime, depends on the configuration.
  • Your proxy is banned by Cloudflare and it's in a loop that can't pass the challenge
  • Website don't use cloudflare challenge, verify that it's challenge and not turnstile, check the examples images.
  • Proxy is giving timeouts, this is common when using Residentials Proxy

If you receive a success response, will look like:

{
    "errorId": 0,
    "taskId": "d1e1487a-2cd8-4d4a-aa4d-4ba5b6c65484",
    "status": "ready",
    "solution": {
        "token": "0.cZJPqwnyDxL86HvAXSk4lUTQhjwfyXDcR3qpVwFofuzosoKr1otKj_A-utazXx_Tnp1B2V6womrltBpRw9HbY851ktpaF7sBN-gQwtoRUew4Wj5PO4-WLYPnNRpXxludXzyQ.1oHJhu7619fb8c07ab942bd1587bc76e0e3cef95c7aa75400c4f7d3",
        "type": "turnstile",
        "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
    }
}

From this response, you will need to parse the values of token and this will be the captcha solution that you will need to submit to the website.

Example for solve Cloudflare Challenge with Python

import time
from curl_cffi import requests

CAPSOLVER_API_KEY = "Your CAPSOLVER.COM API KEY"
PAGE_URL = ""
WEBSITE_KEY = ""

def solvecf(metadata_action=None, metadata_cdata=None):
    url = "https://api.capsolver.com/createTask"
    task = {
        "type": "AntiTurnstileTaskProxyLess",
        "websiteURL": PAGE_URL,
        "websiteKey": WEBSITE_KEY,
    }
    if metadata_action or metadata_cdata:
        task["metadata"] = {}
        if metadata_action:
            task["metadata"]["action"] = metadata_action
        if metadata_cdata:
            task["metadata"]["cdata"] = metadata_cdata
    data = {
        "clientKey": CAPSOLVER_API_KEY,
        "task": task
    }
    response_data = requests.post(url, json=data).json()
    print(response_data)
    return response_data['taskId']


def solutionGet(taskId):
    url = "https://api.capsolver.com/getTaskResult"
    status = ""
    while status != "ready":
        data = {"clientKey": CAPSOLVER_API_KEY, "taskId": taskId}
        response_data = requests.post(url, json=data).json()
        print(response_data)
        status = response_data.get('status', '')
        print(status)
        if status == "ready":
            return response_data['solution']

        time.sleep(2)


def main():
    start_time = time.time()
    
    taskId = solvecf()
    solution = solutionGet(taskId)
    if solution:
        user_agent = solution['userAgent']
        token = solution['token']

    print("User_Agent:", user_agent)
    print("Solved Turnstile Captcha, token:", token)

    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f"Time to solve the captcha: {elapsed_time} seconds")

if __name__ == "__main__":
    main()

Example for solve Cloudflare Challenge with NodeJS

const axios = require('axios');

const CAPSOLVER_API_KEY = "";
const PAGE_URL = "";
const WEBSITE_KEY = "";

async function solvecf(metadata_action = null, metadata_cdata = null) {
    const url = "https://api.capsolver.com/createTask";
    const task = {
        type: "AntiTurnstileTaskProxyLess",
        websiteURL: PAGE_URL,
        websiteKey: WEBSITE_KEY,
    };
    if (metadata_action || metadata_cdata) {
        task.metadata = {};
        if (metadata_action) {
            task.metadata.action = metadata_action;
        }
        if (metadata_cdata) {
            task.metadata.cdata = metadata_cdata;
        }
    }
    const data = {
        clientKey: CAPSOLVER_API_KEY,
        task: task
    };
    const response = await axios.post(url, data);
    console.log(response.data);
    return response.data.taskId;
}

async function solutionGet(taskId) {
    const url = "https://api.capsolver.com/getTaskResult";
    let status = "";
    while (status !== "ready") {
        const data = { clientKey: CAPSOLVER_API_KEY, taskId: taskId };
        const response = await axios.post(url, data);
        console.log(response.data);
        status = response.data.status;
        console.log(status);
        if (status === "ready") {
            return response.data.solution;
        }
        await new Promise(resolve => setTimeout(resolve, 2000));
    }
}

async function main() {
    const start_time = Date.now();
    
    const taskId = await solvecf();
    const solution = await solutionGet(taskId);
    if (solution) {
        const user_agent = solution.userAgent;
        const token = solution.token;

        console.log("User_Agent:", user_agent);
        console.log("Solved Turnstile Captcha, token:", token);
    }

    const end_time = Date.now();
    const elapsed_time = (end_time - start_time) / 1000;
    console.log(`Time to solve the captcha: ${elapsed_time} seconds`);
}

main().catch(console.error);

More

Cloudflare 403 forbidden
How to Solve Cloudflare 403 Forbidden Error and 522/1020/1010/1015/1012

Cloudflare is a widely-used content delivery network (CDN) and security service that helps websites mitigate various threats, including DDoS attacks and abusive bots...

Cloudflare

15-May-2024

Cloudflare Turnstile and Challenge
How to Solve Cloudflare Turnstile and Challenge in 2024

Approximately 20% of the websites that require scraping employ Cloudflare, a robust anti-bot protection system that can easily block your access...

Cloudflare

15-May-2024

How to solve Cloudflare Turnstile Captcha with Python
How to solve Cloudflare Turnstile Captcha with Python

In this article, we will show you how to solve cloudflare turnstile captcha with Python.

Cloudflare

13-May-2024

How to solve Cloudflare Turnstile Captcha with NodeJS
How to solve Cloudflare Turnstile Captcha with NodeJS

In this article, we will show you how to solve cloudflare turnstile captcha with NodeJS.

Cloudflare

13-May-2024

How to Solve Cloudflare when Web Scraping in 2024 | Step by Step Guide
How to Solve Cloudflare when Web Scraping in 2024 | Step by Step Guide

This blog post delves into effective techniques for solve these defenses with the help of CapSolver, a tool adept at resolving CAPTCHAs. From explaining Cloudflare's security protocols to providing practical strategies and code samples for circumventing these restrictions.

Cloudflare

07-May-2024

How to Solve Cloudflare in 2024: Solve Cloudflare Turnstile and Challenge By Using CapSolver
How to Solve Cloudflare in 2024: Solve Cloudflare Turnstile and Challenge By Using CapSolver

navigating Cloudflare's sophisticated security barriers like the Turnstile and Challenge CAPTCHA remains a critical task for many users. This blog post explores effective methods to bypass these protections using CapSolver, a tool designed for solving CAPTCHAs efficiently. Covering everything from understanding Cloudflare's security measures to practical strategies and code examples for overcoming the barriers, this guide is essential for anyone looking to access Cloudflare-protected sites without interruptions.

Cloudflare

07-May-2024