How to Solve Cloudflare when Web Scraping in 2024 | Step by Step Guide
Common Cloudflare Status Codes when doing WebScraping
Error 1020
Cloudflare Error 1020 indicates that access has been denied. This error is triggered when a rule of the website's firewall, which is protected by Cloudflare, is violated. Various actions, such as making excessive requests to the site, can lead to this violation.
Common ways to fix this problem:
- Employ a rotating proxy to mask your IP address.
- Alter and rotate User-Agent headers.
- Ensure your HTTP client supports TLS if utilizing the requests library.
- Utilize browser automation tools like Puppeteer, Playwright, or Selenium.
Error 1015 / 429 rate limit
Cloudflare Error 1015 happens when your IP address is flagged and banned by Cloudflare for exceeding a website's rate limit during scraping activities. This can lead to encountering this error.
Common ways to fix this problem:
- Use rotating proxies or a proxy pool that is big
- Check that the website can't track by your headers, some headers could make them track you and rate limit your requests
- Be sure that you are not getting fingerprinted from tls fingerprint, tcp, or any other way of fingerprint.
Error 403
A 403 status code is a Forbidden response status, issued by a server when it acknowledges a request as valid but refuses to fulfill it. This may occur due to missing necessary headers in your request, such as CORS, JWT, or authentication headers that the server expects.
If the website is generally accessible and adding the correct headers does not resolve the issue, it is possible that the server is detecting your requests as automated.
Common ways to fix this problem:
- You are not sending the correct headers, request information.
- Proxy has been banned
- Need to solve a javascript challenge, check the blog about Cloudflare Challenge 5s to understand how to solve this
- Website doesn't allow any traffic
Identify Cloudflare Challenge 5s and Cloudflare Turnstile Captcha
Cloudflare Challenge 5s
Cloudflare challenge looks like:
Sometimes, this page could have turnstile
Verify that you need to solve Cloudflare Challenge 5s and not just turnstile, for just turnstile, please keep reading this blog.
There are some requeriments when solving this challenge using Capsolver.
- Proxy
- Capsolver API Key
Submitting task information to Capsolver
POST https://api.capsolver.com/createTask
Host: api.capsolver.com
Content-Type: application/json
{
"clientKey": "YOUR_API_KEY",
"task": {
"type": "AntiCloudflareTask",
"websiteURL": "https://www.yourwebsite.com",
"proxy": "158.120.100.23:334:user:pass"
}
}
After submit correctly, API will return a taskId
{
"errorId": 0,
"taskId": "014fc55c-46c9-41c8-9de7-6cb35d984edc",
"status": "idle"
}
Obtain this taskId
value and use for retrieve the result using the getTaskResult
method
Retrieve the result
POST https://api.capsolver.com/getTaskResult
Host: api.capsolver.com
Content-Type: application/json
{
"clientKey": "YOUR_API_KEY",
"taskId": "taskId"
}
Depending on the system load, you will get the results within the interval of 1s
to 20s
If you receive ERROR_CAPTCHA_SOLVE_FAILED
in the response, could be several reasons:
- Your proxy don't need to solve cloudflare challenge 5s (Some websites just enable for bad proxies, bots actions or anything that could trigger that the request is made by a bot). Other times is enabled everytime, depends on the configuration.
- Your proxy is banned by Cloudflare and it's in a loop that can't pass the challenge
- Website don't use cloudflare challenge, verify that it's challenge and not turnstile, check the examples images.
- Proxy is giving timeouts, this is common when using Residentials Proxy
If you receive a success response, will look like:
{
"errorId": 0,
"taskId": "d8d3a8b4-30cc-4b09-802a-a476ca17fa54",
"status": "ready",
"solution": {
"accept-language": "en-US,en;q=0.9",
"cookies": {
},
"headers": {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9",
"cache-control": "max-age=0",
"content-type": "application/x-www-form-urlencoded",
"origin": "",
"referer": "",
"sec-ch-ua": "\"Not_A Brand\";v=\"8\", \"Chromium\";v=\"120\", \"Google Chrome\";v=\"120\"",
"sec-ch-ua-arch": "\"arm\"",
"sec-ch-ua-bitness": "\"64\"",
"sec-ch-ua-full-version": "\"120.0.6099.71\"",
"sec-ch-ua-full-version-list": "\"Not_A Brand\";v=\"8.0.0.0\", \"Chromium\";v=\"120.0.6099.71\", \"Google Chrome\";v=\"120.0.6099.71\"",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-model": "\"\"",
"sec-ch-ua-platform": "\"macOS\"",
"sec-ch-ua-platform-version": "\"10.14.6\"",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
},
"proxy": "your proxy",
"token": "cf clearance token",
"type": "challenge",
"userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}
}
From this response, you will need to parse the values of cookies
, headers
, token
.
Your request will need to look like:
- Headers of the request must be the same as the ones that we returned you in the response
- Cookies of the request must be the same as the ones that we returned you in the response
- Request client must support TLS settings, in this case use TLS Chrome 120
- Use the same proxy for interact on the website and for use the cf_clearance cookie
- Token value will be the
cf_clearance
cookie value that you will need to create
Example for solve Cloudflare Challenge with Python
# -*- coding: utf-8 -*-
import requests
import time
import tls_client
# TODO: Your api key
API_KEY = ""
proxy = ""
# TODO: Your target site url:
page_url = ''
def call_capsolver():
data = {
"clientKey": API_KEY,
"task": {
"type": 'AntiCloudflareTask',
"websiteURL": page_url,
"proxy": proxy,
}
}
uri = 'https://api.capsolver.com/createTask'
res = requests.post(uri, json=data)
resp = res.json()
task_id = resp.get('taskId')
if not task_id:
print("no get taskId:", res.text)
return
print('created taskId:', task_id)
while True:
time.sleep(1)
data = {
"clientKey": API_KEY,
"taskId": task_id
}
response = requests.post('https://api.capsolver.com/getTaskResult', json=data)
resp = response.json()
status = resp.get('status', '')
if status == "ready":
print("successfully => ", response.text)
return resp.get('solution')
if status == "failed" or resp.get("errorId"):
print("failed! => ", response.text)
return
def request_site(solution):
session = tls_client.Session(
client_identifier="chrome_120",
random_tls_extension_order=True
)
return session.get(
page_url,
headers=solution.get('headers'),
cookies=solution.get('cookies'),
proxy=proxy,
)
def main():
solution = {
"headers": {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"sec-fetch-site": "none",
"sec-fetch-mode": "navigate",
"sec-fetch-user": "?1",
"sec-fetch-dest": "document",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9",
}
}
# first request (check your proxy):
res = request_site(solution)
print('1. response status code:', res.status_code)
if res.status_code != 403:
print("your proxy is good and didn't get the cloudflare challenge")
return
elif 'window._cf_chl_opt' not in res.text:
print('==== proxy blocked ==== ')
return
# call capSolver:
solution = call_capsolver()
if not solution:
return
# second request (verify solution):
res = request_site(solution)
print('2. response status code:', res.status_code)
if __name__ == '__main__':
main()
Cloudflare Turnstile Captcha
Cloudflare Turnstile Captcha looks like:
-
Managed challenge
-
Non-interactive challenge
-
Invisible challenge
not visible, you can check on the network / scripts loaded and see if turnstile is used
Verify that you need to solve Cloudflare Turnstile Captcha and not Cloudflare Challenge 5s, for just Cloudflare Challenge, please keep reading this blog.
There are some requeriments when solving this challenge using Capsolver.
- Capsolver API Key
Submitting task information to Capsolver
POST https://api.capsolver.com/createTask
Host: api.capsolver.com
Content-Type: application/json
{
"clientKey": "YOUR_API_KEY",
"task": {
"type": "AntiTurnstileTaskProxyLess",
"websiteURL": "https://www.yourwebsite.com",
"websiteKey": "0x4XXXXXXXXXXXXXXXXX",
"metadata": {
"action": "login", //optional
"cdata": "0000-1111-2222-3333-example-cdata" //optional
}
}
}
"action" and "cdata" is optional, sometimes will be required and sometimes not.
Depends on the configuration of the website.
action is the value of the data-action attribute of the Turnstile element if it exists.
cdata is the value of the data-cdata attribute of the Turnstile element if it exists.
After submit correctly, API will return a taskId
{
"errorId": 0,
"taskId": "014fc55c-46c9-41c8-9de7-6cb35d984edc",
"status": "idle"
}
Obtain this taskId
value and use for retrieve the result using the getTaskResult
method
Retrieve the result
POST https://api.capsolver.com/getTaskResult
Host: api.capsolver.com
Content-Type: application/json
{
"clientKey": "YOUR_API_KEY",
"taskId": "taskId"
}
Depending on the system load, you will get the results within the interval of 1s
to 20s
If you receive ERROR_CAPTCHA_SOLVE_FAILED
in the response, could be several reasons:
- Your proxy don't need to solve cloudflare challenge 5s (Some websites just enable for bad proxies, bots actions or anything that could trigger that the request is made by a bot). Other times is enabled everytime, depends on the configuration.
- Your proxy is banned by Cloudflare and it's in a loop that can't pass the challenge
- Website don't use cloudflare challenge, verify that it's challenge and not turnstile, check the examples images.
- Proxy is giving timeouts, this is common when using Residentials Proxy
If you receive a success response, will look like:
{
"errorId": 0,
"taskId": "d1e1487a-2cd8-4d4a-aa4d-4ba5b6c65484",
"status": "ready",
"solution": {
"token": "0.cZJPqwnyDxL86HvAXSk4lUTQhjwfyXDcR3qpVwFofuzosoKr1otKj_A-utazXx_Tnp1B2V6womrltBpRw9HbY851ktpaF7sBN-gQwtoRUew4Wj5PO4-WLYPnNRpXxludXzyQ.1oHJhu7619fb8c07ab942bd1587bc76e0e3cef95c7aa75400c4f7d3",
"type": "turnstile",
"userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}
}
From this response, you will need to parse the values of token
and this will be the captcha solution that you will need to submit to the website.
Example for solve Cloudflare Challenge with Python
import time
from curl_cffi import requests
CAPSOLVER_API_KEY = "Your CAPSOLVER.COM API KEY"
PAGE_URL = ""
WEBSITE_KEY = ""
def solvecf(metadata_action=None, metadata_cdata=None):
url = "https://api.capsolver.com/createTask"
task = {
"type": "AntiTurnstileTaskProxyLess",
"websiteURL": PAGE_URL,
"websiteKey": WEBSITE_KEY,
}
if metadata_action or metadata_cdata:
task["metadata"] = {}
if metadata_action:
task["metadata"]["action"] = metadata_action
if metadata_cdata:
task["metadata"]["cdata"] = metadata_cdata
data = {
"clientKey": CAPSOLVER_API_KEY,
"task": task
}
response_data = requests.post(url, json=data).json()
print(response_data)
return response_data['taskId']
def solutionGet(taskId):
url = "https://api.capsolver.com/getTaskResult"
status = ""
while status != "ready":
data = {"clientKey": CAPSOLVER_API_KEY, "taskId": taskId}
response_data = requests.post(url, json=data).json()
print(response_data)
status = response_data.get('status', '')
print(status)
if status == "ready":
return response_data['solution']
time.sleep(2)
def main():
start_time = time.time()
taskId = solvecf()
solution = solutionGet(taskId)
if solution:
user_agent = solution['userAgent']
token = solution['token']
print("User_Agent:", user_agent)
print("Solved Turnstile Captcha, token:", token)
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Time to solve the captcha: {elapsed_time} seconds")
if __name__ == "__main__":
main()
Example for solve Cloudflare Challenge with NodeJS
const axios = require('axios');
const CAPSOLVER_API_KEY = "";
const PAGE_URL = "";
const WEBSITE_KEY = "";
async function solvecf(metadata_action = null, metadata_cdata = null) {
const url = "https://api.capsolver.com/createTask";
const task = {
type: "AntiTurnstileTaskProxyLess",
websiteURL: PAGE_URL,
websiteKey: WEBSITE_KEY,
};
if (metadata_action || metadata_cdata) {
task.metadata = {};
if (metadata_action) {
task.metadata.action = metadata_action;
}
if (metadata_cdata) {
task.metadata.cdata = metadata_cdata;
}
}
const data = {
clientKey: CAPSOLVER_API_KEY,
task: task
};
const response = await axios.post(url, data);
console.log(response.data);
return response.data.taskId;
}
async function solutionGet(taskId) {
const url = "https://api.capsolver.com/getTaskResult";
let status = "";
while (status !== "ready") {
const data = { clientKey: CAPSOLVER_API_KEY, taskId: taskId };
const response = await axios.post(url, data);
console.log(response.data);
status = response.data.status;
console.log(status);
if (status === "ready") {
return response.data.solution;
}
await new Promise(resolve => setTimeout(resolve, 2000));
}
}
async function main() {
const start_time = Date.now();
const taskId = await solvecf();
const solution = await solutionGet(taskId);
if (solution) {
const user_agent = solution.userAgent;
const token = solution.token;
console.log("User_Agent:", user_agent);
console.log("Solved Turnstile Captcha, token:", token);
}
const end_time = Date.now();
const elapsed_time = (end_time - start_time) / 1000;
console.log(`Time to solve the captcha: ${elapsed_time} seconds`);
}
main().catch(console.error);