Blog
How to Solve Datadome Anti Scraping in 2024

How to Solve Datadome Anti Scraping in 2024

Logo of Capsolver

CapSolver Blogger

How to use capsolver

23-May-2024

Scraping data from websites protected by advanced technologies like DataDome is increasingly challenging. DataDome uses AI and machine learning to detect and block malicious web scrapers, employing techniques such as HTTP2.0 communication, IP detection, mouse movement tracking, and browser fingerprinting. This article explores practical solutions to solve these defenses and introduces CapSolver, an AI-based captcha-solving service that can quickly overcome DataDome's security measures.

Table of Contents

  1. What is DataDome
  2. Identifying Websites with DataDome Anti-Scraping
  3. Resolving DataDome Anti-Scraping
  4. Using CapSolver to Solve DataDome Challenges

1. What is DataDome

1.1 Understanding Datadome

DataDome serves as a comprehensive web security solution designed to combat a range of cyber threats, including DDoS attacks, SQL injections, XSS vulnerabilities, and fraud related to payments and credit cards. It also focuses on thwarting web scraping activities that attempt to extract data from websites.

This platform is widely deployed across various sectors like e-commerce, news media, music streaming, and real estate, safeguarding websites from malicious activities. Therefore, encountering DataDome and similar anti-bot technologies such as Cloudflare is common when engaging in web scraping. DataDome's security measures are remarkably progressive and difficult to outperform. So understanding how they work can effectively help you bypass them.

DataDome Protection Process:

1.2 DataDome Verification Types

  • Device Verification: When first accessing a website protected by DataDome, the SDK collects extensive client environment data to determine if the access is malicious.
  • Slider Verification: High-risk sites may implement slider verification, where the SDK collects extensive client environment data and user behavior.

2. Identifying Websites with DataDome Anti-Scraping

To identify if a website is using DataDome for anti-scraping:

  • Open your browser's developer tools and inspect the network requests.
  • If the application returns a 403 status code and sets a cookie datadome=xxx, it indicates the presence of DataDome protection.

3. Resolving DataDome Anti-Scraping

3.1 HTTP Requests

DataDome uses HTTP2.0 for communication and checks TLS fingerprints like JA3. Python's requests library, which uses HTTP1.0, cannot pass DataDome's TLS fingerprint validation. Consider using libraries such as httpx, pycurl, or curl_cffi for making requests.

3.2 IP Detection

DataDome also checks the IP address of incoming requests. Be aware that IPs from certain regions might be blocked. Rapid and large-scale requests can also lead to IP bans. Using high-quality proxy IPs can bypass IP detection. Prefer residential proxies and those from the same region as the target website.

3.3 Mouse Movements

When slider validation appears in DataDome, the slider captures the user's mouse movement tracking, and the server uses machine learning to analyse your tracking and distinguish between humans and robots.

Use Bezier curves to simulate human sliding tracking, Bezier curves are smooth curves defined by a series of control points, commonly used in computer graphics and design, depending on the position and number of control points, you can create curves of varying degrees of complexity, from simple straight lines to complex curves can be covered.

3.4 Browser Fingerprints and Code Obfuscation

DataDome's requests have data encryption, which encrypts important request data and browser fingerprints, and there is a great deal of obfuscation in JavaScript, such as String concealing
There is a lot of obfuscation in JavaScript, such as String concealing, Control Flow obfuscation, Function obfuscation, and so on.

If you need to reverse engineering, you can use AST anti-obfuscation technology to partially restore the code, dynamic debugging to find the data encryption points and the original data, DataDome has hundreds of fingerprint verification, if you try to reverse more time-consuming and labour-intensive.

3.5 Headless Browser Detection

DataDome detects headless browsers like Headless Chrome, Puppeteer, and Selenium, flagging them as abnormal environments. Use extension like Puppeteer-stealth if you require headless browser functionality.

4. Using CapSolver to Solve DataDome Challenges

CapSolver is a machine learning-based captcha recognition solution that can solve DataDome challenges within 3-5 seconds!

To solve CAPTCHA using CapSolver, the Python sample code is as follows:

# -*- coding: utf-8 -*-
import requests
import time


api_key = "your api key"
page_url = "your website url"
proxy = "host:port:name:password" 

user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"


def get_token(captcha_url):
    print("call capsolver...")
    data = {
        "clientKey": api_key,
        "task": {
            "type": 'DatadomeSliderTask',
            "websiteURL": page_url,
            "captchaUrl": captcha_url,
            "userAgent": user_agent,
            "proxy": proxy,
        },
    }
    uri = 'https://api.capsolver.com/createTask'
    res = requests.post(uri, json=data)
    resp = res.json()
    task_id = resp.get('taskId')
    if not task_id:
        print("create task error:", res.text)
        return

    while True:
        time.sleep(1)
        data = {
            "taskId": task_id
        }
        res = requests.post('https://api.capsolver.com/getTaskResult', json=data)
        # print(res.text)
        resp = res.json()
        status = resp.get('status', '')
        if status == "ready":
            cookie = resp['solution']['cookie']
            cookie = cookie.split(';')[0].split('=')[1]
            print("successfully got cookie:", cookie)
            return cookie
        if status == "failed" or resp.get("errorId"):
            print("failed to get cookie:", res.text)
            return
        print('solve datadome status:', status)



def format_proxy(px: str):
    if '@' not in px:
        sp = px.split(':')
        if len(sp) == 4:
            px = f'{sp[2]}:{sp[3]}@{sp[0]}:{sp[1]}'
    return {"http": f"http://{px}", "https": f"http://{px}"}


def request_site(cookie):
    headers = {
        'content-type': 'application/json',
        'user-agent': user_agent,
        'accept': 'application/json, text/plain, */*',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'cors',
        'sec-fetch-dest': 'empty',
        'referer': page_url,
        'accept-encoding': 'gzip, deflate, br, zstd',
        'accept-language': 'en-US,en;q=0.9',
    }
    if cookie:
        headers['cookie'] = "datadome=" + cookie

    print("request url:", page_url)
    # The primary site does not have TLS fingerprinting, and 'requests' can be used
    response = requests.get(
        page_url, headers=headers, proxies=format_proxy(proxy), allow_redirects=True
    )
    print("response status_code:", response.status_code)
    if response.status_code == 403:
        resp = response.json()
        print("captcha url: ", resp['url'])
        return resp['url']
    else:
        # print("response:", response.text)
        print('cookie is good!')
        return


def main():
    url = request_site("")
    if not url:
        return
    if 't=bv' in url:
        print("blocked captcha url is not supported")
        return
    cookie = get_token(url)
    if not cookie:
        return
    request_site(cookie)


if __name__ == '__main__':
    main()

Executive results:

Conclusion

This comprehensive guide outlines the methods to overcome DataDome's anti-scraping measures. CapSolver can successfully help you navigate the various challenges posed by DataDome. Whether it's handling HTTP2.0 communication, IP detection, mouse movement analysis, or browser fingerprinting, CapSolver offers a quick and effective solution, ensuring your data collection needs are met efficiently.

More

Change the User-Agent in Selenium
Change the User-Agent in Selenium | Steps & Best Practices

Changing the User Agent in Selenium is a crucial step for many web scraping tasks. It helps to disguise the automation script as a regular browser...

The other captcha

13-Jun-2024

web crawler in python
Web Crawler in Python and How to Avoid Getting Blocked When Web Crawling

Web crawling, also known as web scraping, is the automated process of navigating through websites, extracting data, and storing it for various purposes such as data analysis, market research, and content aggregation...

The other captcha

11-Jun-2024

Web Scraping in C
Web Scraping in C#: Without Getting Blocked

Enhance your web scraping skills with C#. Master efficient data extraction using advanced libraries and techniques in our expert guide. Start now!

The other captcha

07-Jun-2024

How to Solve DataDome 403
How to Solve DataDome 403 Forbidden Error in Web Scraping | Complete Solution

Unlock the secrets to overcoming DataDome's 403 Forbidden error in web scraping, ensuring uninterrupted access to your valuable data.

The other captcha

05-Jun-2024

Scrapy vs. Beautiful Soup
Scrapy vs. Beautiful Soup | Web Scraping Tutorial 2024

Dive into the world of web scraping with Scrapy and Beautiful Soup, and master CAPTCHA challenges seamlessly with CapSolver.

The other captcha

31-May-2024

How to Solve Imperva Incapsula
How to Solve Imperva Incapsula When Web Scraping in 2024 | Complete Guide

Web scraping with Imperva Incapsula's security is challenging. This guide explores identifying Imperva-protected sites, reverse engineering, network detection, and using CapSolver for efficient solving in 2024.

The other captcha

29-May-2024