How to Use ScrapeGraph AI for Web Scraping

Lucas Mitchell

Automation Engineer

04-Sep-2024

How to Use ScrapeGraph AI for Web Scraping

What is ScrapeGraph AI?

ScrapeGraph AI is a Python web scraping library that leverages LLMs and graph-based logic to build scraping pipelines for websites and local documents (including XML, HTML, JSON, Markdown, and more). Simply specify the data you want to extract, and the library will handle the rest!

The library provides several features:

Support many LLMs: GPT, Gemini, Groq, Azure, Hugging Face
Local Models: Ollama.
Proxy support for handling requests behind proxies.

Prerequisites

Before you dive into using ScrapeGraph AI, ensure you have the following installed:

bash Copy

pip install scrapegraphai capsolver

playwright install

Getting Started with ScrapeGraph AI

Here's a basic example of how to use ScrapeGraph AI with OpenAI to scrape a webpage:

python Copy

import json
from scrapegraphai.graphs import SmartScraperGraph

# Define the configuration for the scraping pipeline
graph_config = {
    "llm": {
        "api_key": "YOUR_OPENAI_APIKEY",
        "model": "openai/gpt-4o-mini",
    },
    "verbose": True,
    "headless": False,
}

# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the quotes with their description",
    source="https://quotes.toscrape.com/",
    config=graph_config
)

# Run the pipeline
result = smart_scraper_graph.run()
print(json.dumps(result, indent=4))

Here's a basic example of how to use ScrapeGraph AI with Local LLM (Ollama) to scrape a webpage:

python Copy

import json
from scrapegraphai.graphs import SmartScraperGraph

# Define the configuration for the scraping pipeline
graph_config = {
    "llm": {
        "model": "ollama/llama3.1",
        "temperature": 0,
        "format": "json",  # Ollama needs the format to be specified explicitly
        # "base_url": "http://localhost:11434", # set ollama URL arbitrarily
    },
    "verbose": True,
    "headless": False
}

# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the quotes with their description",
    source="https://quotes.toscrape.com/",
    config=graph_config
)

# Run the pipeline
result = smart_scraper_graph.run()
print(json.dumps(result, indent=4))

Handling Captchas with CapSolver and ScrapeGraph AI

In this section, we'll explore how to integrate Capsolver with ScrapeGraph AI to bypass captchas. CapSolver is an external service that helps in solving various types of captchas, including ReCaptcha V2, which is commonly used on websites.

We will demonstrate solving ReCaptcha V2 using Capsolver and then scraping the content of a page that requires solving the captcha first.

Bonus Code

Claim Your Bonus Code for top captcha solutions; CapSolver: scrape. After redeeming it, you will get an extra 5% bonus after each recharge, Unlimited

Example: Solving ReCaptcha V2 with Capsolver and ScrapeGraph AI

python Copy

import capsolver
import os
import json
from scrapegraphai.graphs import SmartScraperGraph

# Consider using environment variables for sensitive information
PROXY = os.getenv("PROXY", "http://username:password@host:port")
capsolver.api_key = os.getenv("CAPSOLVER_API_KEY", "Your Capsolver API Key")
PAGE_URL = os.getenv("PAGE_URL", "PAGE_URL")
PAGE_KEY = os.getenv("PAGE_SITE_KEY", "PAGE_SITE_KEY")

def solve_recaptcha_v2(url, key):
    solution = capsolver.solve({
        "type": "ReCaptchaV2Task",
        "websiteURL": url,
        "websiteKey": key,
        "proxy": PROXY
    })
    return solution['solution']['gRecaptchaResponse']

def main():
    print("Solving reCaptcha v2")
    solution = solve_recaptcha_v2(PAGE_URL, PAGE_KEY)
    print("Solution: ", solution)

# Define the configuration for the scraping pipeline
graph_config = {
    "llm": {
        "api_key": "YOUR_OPENAI_APIKEY",
        "model": "openai/gpt-4o-mini",
    },
    "verbose": True,
    "headless": False,
}

# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
    prompt="Find the description of each quote.",
    source="https://quotes.toscrape.com/",
    config=graph_config
)

# Run the pipeline
result = smart_scraper_graph.run()
print(json.dumps(result, indent=4))

Conclusion

With ScrapeGraph AI, you can efficiently scrape websites while handling the complexities of proxies and captchas. Combining it with Capsolver allows you to bypass ReCaptcha V2 challenges seamlessly, enabling access to content that would otherwise be difficult to scrape.

Feel free to extend this script to suit your scraping needs and experiment with additional features offered by ScrapeGraph AI. Always ensure that your scraping activities respect website terms of service and legal guidelines.

Happy scraping!

Compliance Disclaimer: The information provided on this blog is for informational purposes only. CapSolver is committed to compliance with all applicable laws and regulations. The use of the CapSolver network for illegal, fraudulent, or abusive activities is strictly prohibited and will be investigated. Our captcha-solving solutions enhance user experience while ensuring 100% compliance in helping solve captcha difficulties during public data crawling. We encourage responsible use of our services. For more information, please visit our Terms of Service and Privacy Policy.

AI-powered Image Recognition: The Basics and How to Solve it

Say goodbye to image CAPTCHA struggles – CapSolver Vision Engine solves them fast, smart, and hassle-free!

Lucas Mitchell

24-Apr-2025

Best User Agents for Web Scraping & How to Use Them

A guide to the best user agents for web scraping and their effective use to avoid detection. Explore the importance of user agents, types, and how to implement them for seamless and undetectable web scraping.

Ethan Collins

07-Mar-2025

What is a Captcha? Can Captcha Track You?

Ever wondered what a CAPTCHA is and why websites make you solve them? Learn how CAPTCHAs work, whether they track you, and why they’re crucial for web security. Plus, discover how to bypass CAPTCHAs effortlessly with CapSolver for web scraping and automation.

Lucas Mitchell

05-Mar-2025

Cloudflare TLS Fingerprinting: What It Is and How to Solve It

Learn about Cloudflare's use of TLS fingerprinting for security, how it detects and blocks bots, and explore effective methods to solve it for web scraping and automated browsing tasks.

Cloudflare

Lucas Mitchell

28-Feb-2025

Why do I keep getting asked to verify I'm not a robot?

Learn why Google prompts you to verify you're not a robot and explore solutions like using CapSolver’s API to solve CAPTCHA challenges efficiently.

Ethan Collins

27-Feb-2025

What is the best CAPTCHA solver in 2025

Discover the best CAPTCHA solver in 2025 with CapSolver, the ultimate tool for automated web scraping, CAPTCHA bypass, and data collection using advanced AI and machine learning. Enjoy bonus codes, seamless integration, and real-world examples to boost your scraping efficiency.

Aloísio Vítor

25-Feb-2025

How to Use ScrapeGraph AI for Web Scraping

How to Use ScrapeGraph AI for Web Scraping

What is ScrapeGraph AI?

Prerequisites

Getting Started with ScrapeGraph AI

Here's a basic example of how to use ScrapeGraph AI with OpenAI to scrape a webpage:

Here's a basic example of how to use ScrapeGraph AI with Local LLM (Ollama) to scrape a webpage:

Handling Captchas with CapSolver and ScrapeGraph AI

Bonus Code

Example: Solving ReCaptcha V2 with Capsolver and ScrapeGraph AI

Conclusion

More

AI-powered Image Recognition: The Basics and How to Solve it

Best User Agents for Web Scraping & How to Use Them

What is a Captcha? Can Captcha Track You?

Cloudflare TLS Fingerprinting: What It Is and How to Solve It

Why do I keep getting asked to verify I'm not a robot?

What is the best CAPTCHA solver in 2025