How to Solve Datadome Anti Scraping in 2024

Scraping data from websites protected by advanced technologies like DataDome is increasingly challenging. DataDome uses AI and machine learning to detect and block malicious web scrapers, employing techniques such as HTTP2.0 communication, IP detection, mouse movement tracking, and browser fingerprinting. This article explores practical solutions to solve these defenses and introduces CapSolver, an AI-based captcha-solving service that can quickly overcome DataDome's security measures.

Table of Contents

What is DataDome
Identifying Websites with DataDome Anti-Scraping
Resolving DataDome Anti-Scraping
Using CapSolver to Solve DataDome Challenges

1. What is DataDome

1.1 Understanding Datadome

DataDome serves as a comprehensive web security solution designed to combat a range of cyber threats, including DDoS attacks, SQL injections, XSS vulnerabilities, and fraud related to payments and credit cards. It also focuses on thwarting web scraping activities that attempt to extract data from websites.

This platform is widely deployed across various sectors like e-commerce, news media, music streaming, and real estate, safeguarding websites from malicious activities. Therefore, encountering DataDome and similar anti-bot technologies such as Cloudflare is common when engaging in web scraping. DataDome's security measures are remarkably progressive and difficult to outperform. So understanding how they work can effectively help you bypass them.

DataDome Protection Process:

1.2 DataDome Verification Types

Device Verification: When first accessing a website protected by DataDome, the SDK collects extensive client environment data to determine if the access is malicious.
Slider Verification: High-risk sites may implement slider verification, where the SDK collects extensive client environment data and user behavior.

2. Identifying Websites with DataDome Anti-Scraping

To identify if a website is using DataDome for anti-scraping:

Open your browser's developer tools and inspect the network requests.
If the application returns a 403 status code and sets a cookie datadome=xxx, it indicates the presence of DataDome protection.

3. Resolving DataDome Anti-Scraping

3.1 HTTP Requests

DataDome uses HTTP2.0 for communication and checks TLS fingerprints like JA3. Python's requests library, which uses HTTP1.0, cannot pass DataDome's TLS fingerprint validation. Consider using libraries such as httpx, pycurl, or curl_cffi for making requests.

3.2 IP Detection

DataDome also checks the IP address of incoming requests. Be aware that IPs from certain regions might be blocked. Rapid and large-scale requests can also lead to IP bans. Using high-quality proxy IPs can bypass IP detection. Prefer residential proxies and those from the same region as the target website.

3.3 Mouse Movements

When slider validation appears in DataDome, the slider captures the user's mouse movement tracking, and the server uses machine learning to analyse your tracking and distinguish between humans and robots.

Use Bezier curves to simulate human sliding tracking, Bezier curves are smooth curves defined by a series of control points, commonly used in computer graphics and design, depending on the position and number of control points, you can create curves of varying degrees of complexity, from simple straight lines to complex curves can be covered.

3.4 Browser Fingerprints and Code Obfuscation

DataDome's requests have data encryption, which encrypts important request data and browser fingerprints, and there is a great deal of obfuscation in JavaScript, such as String concealing
There is a lot of obfuscation in JavaScript, such as String concealing, Control Flow obfuscation, Function obfuscation, and so on.

If you need to reverse engineering, you can use AST anti-obfuscation technology to partially restore the code, dynamic debugging to find the data encryption points and the original data, DataDome has hundreds of fingerprint verification, if you try to reverse more time-consuming and labour-intensive.

3.5 Headless Browser Detection

DataDome detects headless browsers like Headless Chrome, Puppeteer, and Selenium, flagging them as abnormal environments. Use extension like Puppeteer-stealth if you require headless browser functionality.

4. Using CapSolver to Solve DataDome Challenges

CapSolver is a machine learning-based captcha recognition solution that can solve DataDome challenges within 3-5 seconds!

To solve CAPTCHA using CapSolver, the Python sample code is as follows:

# -*- coding: utf-8 -*-
import requests
import time


api_key = "your api key"
page_url = "your website url"
proxy = "host:port:name:password" 

user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"


def get_token(captcha_url):
    print("call capsolver...")
    data = {
        "clientKey": api_key,
        "task": {
            "type": 'DatadomeSliderTask',
            "websiteURL": page_url,
            "captchaUrl": captcha_url,
            "userAgent": user_agent,
            "proxy": proxy,
        },
    }
    uri = 'https://api.capsolver.com/createTask'
    res = requests.post(uri, json=data)
    resp = res.json()
    task_id = resp.get('taskId')
    if not task_id:
        print("create task error:", res.text)
        return

    while True:
        time.sleep(1)
        data = {
            "taskId": task_id
        }
        res = requests.post('https://api.capsolver.com/getTaskResult', json=data)
        # print(res.text)
        resp = res.json()
        status = resp.get('status', '')
        if status == "ready":
            cookie = resp['solution']['cookie']
            cookie = cookie.split(';')[0].split('=')[1]
            print("successfully got cookie:", cookie)
            return cookie
        if status == "failed" or resp.get("errorId"):
            print("failed to get cookie:", res.text)
            return
        print('solve datadome status:', status)



def format_proxy(px: str):
    if '@' not in px:
        sp = px.split(':')
        if len(sp) == 4:
            px = f'{sp[2]}:{sp[3]}@{sp[0]}:{sp[1]}'
    return {"http": f"http://{px}", "https": f"http://{px}"}


def request_site(cookie):
    headers = {
        'content-type': 'application/json',
        'user-agent': user_agent,
        'accept': 'application/json, text/plain, */*',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'cors',
        'sec-fetch-dest': 'empty',
        'referer': page_url,
        'accept-encoding': 'gzip, deflate, br, zstd',
        'accept-language': 'en-US,en;q=0.9',
    }
    if cookie:
        headers['cookie'] = "datadome=" + cookie

    print("request url:", page_url)
    # The primary site does not have TLS fingerprinting, and 'requests' can be used
    response = requests.get(
        page_url, headers=headers, proxies=format_proxy(proxy), allow_redirects=True
    )
    print("response status_code:", response.status_code)
    if response.status_code == 403:
        resp = response.json()
        print("captcha url: ", resp['url'])
        return resp['url']
    else:
        # print("response:", response.text)
        print('cookie is good!')
        return


def main():
    url = request_site("")
    if not url:
        return
    if 't=bv' in url:
        print("blocked captcha url is not supported")
        return
    cookie = get_token(url)
    if not cookie:
        return
    request_site(cookie)


if __name__ == '__main__':
    main()

Executive results:

Conclusion

This comprehensive guide outlines the methods to overcome DataDome's anti-scraping measures. CapSolver can successfully help you navigate the various challenges posed by DataDome. Whether it's handling HTTP2.0 communication, IP detection, mouse movement analysis, or browser fingerprinting, CapSolver offers a quick and effective solution, ensuring your data collection needs are met efficiently.