Web Scraping with SeleniumBase and Python in 2024
Lucas Mitchell
Automation Engineer
05-Nov-2024
Web scraping is a powerful tool for data extraction, market research, and automation. However, CAPTCHAs can hinder automated scraping efforts. In this guide, we'll explore how to use SeleniumBase for web scraping and integrate CapSolver to solve CAPTCHAs efficiently, using quotes.toscrape.com as our example website.
Introduction to SeleniumBase
SeleniumBase is a Python framework that simplifies web automation and testing. It extends Selenium WebDriver's capabilities with a more user-friendly API, advanced selectors, automatic waits, and additional testing tools.
Setting Up SeleniumBase
Before we begin, ensure you have Python 3 installed on your system. Follow these steps to set up SeleniumBase:
-
Install SeleniumBase:
bashpip install seleniumbase
-
Verify the Installation:
bashsbase --help
Basic Scraper with SeleniumBase
Let's start by creating a simple script that navigates to quotes.toscrape.com and extracts quotes and authors.
Example: Scrape quotes and their authors from the homepage.
python
# scrape_quotes.py
from seleniumbase import BaseCase
class QuotesScraper(BaseCase):
def test_scrape_quotes(self):
self.open("https://quotes.toscrape.com/")
quotes = self.find_elements("div.quote")
for quote in quotes:
text = quote.find_element("span.text").text
author = quote.find_element("small.author").text
print(f"\"{text}\" - {author}")
if __name__ == "__main__":
QuotesScraper().main()
Run the script:
bash
python scrape_quotes.py
Output:
âThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.â - Albert Einstein
...
More Advanced Web Scraping Examples
To enhance your web scraping skills, let's explore more advanced examples using SeleniumBase.
Scraping Multiple Pages (Pagination)
Many websites display content across multiple pages. Let's modify our script to navigate through all pages and scrape quotes.
python
# scrape_quotes_pagination.py
from seleniumbase import BaseCase
class QuotesPaginationScraper(BaseCase):
def test_scrape_all_quotes(self):
self.open("https://quotes.toscrape.com/")
while True:
quotes = self.find_elements("div.quote")
for quote in quotes:
text = quote.find_element("span.text").text
author = quote.find_element("small.author").text
print(f"\"{text}\" - {author}")
# Check if there is a next page
if self.is_element_visible('li.next > a'):
self.click('li.next > a')
else:
break
if __name__ == "__main__":
QuotesPaginationScraper().main()
Explanation:
- We loop through pages by checking if the "Next" button is available.
- We use
is_element_visible
to check for the "Next" button. - We click the "Next" button to navigate to the next page.
Handling Dynamic Content with AJAX
Some websites load content dynamically using AJAX. SeleniumBase can handle such scenarios by waiting for elements to load.
Example: Scrape tags from the website, which load dynamically.
python
# scrape_dynamic_content.py
from seleniumbase import BaseCase
class TagsScraper(BaseCase):
def test_scrape_tags(self):
self.open("https://quotes.toscrape.com/")
# Click on the 'Top Ten tags' link to load tags dynamically
self.click('a[href="/tag/"]')
self.wait_for_element("div.tags-box")
tags = self.find_elements("span.tag-item > a")
for tag in tags:
tag_name = tag.text
print(f"Tag: {tag_name}")
if __name__ == "__main__":
TagsScraper().main()
Explanation:
- We wait for the
div.tags-box
element to ensure the dynamic content is loaded. wait_for_element
ensures that the script doesn't proceed until the element is available.
Submitting Forms and Logging In
Sometimes, you need to log in to a website before scraping content. Here's how you can handle form submission.
Example: Log in to the website and scrape quotes from the authenticated user page.
python
# scrape_with_login.py
from seleniumbase import BaseCase
class LoginScraper(BaseCase):
def test_login_and_scrape(self):
self.open("https://quotes.toscrape.com/login")
# Fill in the login form
self.type("input#username", "testuser")
self.type("input#password", "testpass")
self.click("input[type='submit']")
# Verify login by checking for a logout link
if self.is_element_visible('a[href="/logout"]'):
print("Logged in successfully!")
# Now scrape the quotes
self.open("https://quotes.toscrape.com/")
quotes = self.find_elements("div.quote")
for quote in quotes:
text = quote.find_element("span.text").text
author = quote.find_element("small.author").text
print(f"\"{text}\" - {author}")
else:
print("Login failed.")
if __name__ == "__main__":
LoginScraper().main()
Explanation:
- We navigate to the login page and fill in the credentials.
- After submitting the form, we verify the login by checking for the presence of a logout link.
- Then we proceed to scrape the content available to logged-in users.
Note: Since quotes.toscrape.com
allows any username and password for demonstration, we can use dummy credentials.
Extracting Data from Tables
Websites often present data in tables. Here's how to extract table data.
Example: Scrape data from a table (hypothetical example as the website doesn't have tables).
python
# scrape_table.py
from seleniumbase import BaseCase
class TableScraper(BaseCase):
def test_scrape_table(self):
self.open("https://www.example.com/table-page")
# Wait for the table to load
self.wait_for_element("table#data-table")
rows = self.find_elements("table#data-table > tbody > tr")
for row in rows:
cells = row.find_elements("td")
row_data = [cell.text for cell in cells]
print(row_data)
if __name__ == "__main__":
TableScraper().main()
Explanation:
- We locate the table by its ID or class.
- We iterate over each row and then over each cell to extract data.
- Since
quotes.toscrape.com
doesn't have tables, replace the URL with a real website that contains a table.
Integrating CapSolver into SeleniumBase
While quotes.toscrape.com does not have CAPTCHAs, many real-world websites do. To prepare for such cases, we'll demonstrate how to integrate CapSolver into our SeleniumBase script using the CapSolver browser extension.
How to solve captchas with SeleniumBase using Capsolver
-
Download the CapSolver Extension:
- Visit the CapSolver GitHub releases page.
- Download the latest version of the CapSolver browser extension.
- Unzip the extension into a directory at the root of your project, e.g.,
./capsolver_extension
.
Configuring the CapSolver Extension
-
Locate the Configuration File:
- Find the
config.json
file located in thecapsolver_extension/assets
directory.
- Find the
-
Update the Configuration:
- Set
enabledForcaptcha
and/orenabledForRecaptchaV2
totrue
depending on the CAPTCHA types you want to solve. - Set the
captchaMode
orreCaptchaV2Mode
to"token"
for automatic solving.
Example
config.json
:json{ "apiKey": "YOUR_CAPSOLVER_API_KEY", "enabledForcaptcha": true, "captchaMode": "token", "enabledForRecaptchaV2": true, "reCaptchaV2Mode": "token", "solveInvisibleRecaptcha": true, "verbose": false }
- Replace
"YOUR_CAPSOLVER_API_KEY"
with your actual CapSolver API key.
- Set
Loading the CapSolver Extension in SeleniumBase
To use the CapSolver extension in SeleniumBase, we need to configure the browser to load the extension when it starts.
-
Modify Your SeleniumBase Script:
- Import
ChromeOptions
fromselenium.webdriver.chrome.options
. - Set up the options to load the CapSolver extension.
Example:
pythonfrom seleniumbase import BaseCase from selenium.webdriver.chrome.options import Options as ChromeOptions import os class QuotesScraper(BaseCase): def setUp(self): super().setUp() # Path to the CapSolver extension extension_path = os.path.abspath('capsolver_extension') # Configure Chrome options options = ChromeOptions() options.add_argument(f"--load-extension={extension_path}") options.add_argument("--disable-gpu") options.add_argument("--no-sandbox") # Update the driver with the new options self.driver.quit() self.driver = self.get_new_driver(browser_name="chrome", options=options)
- Import
-
Ensure the Extension Path is Correct:
- Make sure the
extension_path
points to the directory where you unzipped the CapSolver extension.
- Make sure the
Example Script with CapSolver Integration
Here's a complete script that integrates CapSolver into SeleniumBase to solve CAPTCHAs automatically. We'll continue to use https://recaptcha-demo.appspot.com/recaptcha-v2-checkbox.php as our example site.
python
# scrape_quotes_with_capsolver.py
from seleniumbase import BaseCase
from selenium.webdriver.chrome.options import Options as ChromeOptions
import os
class QuotesScraper(BaseCase):
def setUp(self):
super().setUp()
# Path to the CapSolver extension folder
# Ensure this path points to the CapSolver Chrome extension folder correctly
extension_path = os.path.abspath('capsolver_extension')
# Configure Chrome options
options = ChromeOptions()
options.add_argument(f"--load-extension={extension_path}")
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")
# Update the driver with the new options
self.driver.quit() # Close any existing driver instance
self.driver = self.get_new_driver(browser_name="chrome", options=options)
def test_scrape_quotes(self):
# Navigate to the target site with reCAPTCHA
self.open("https://recaptcha-demo.appspot.com/recaptcha-v2-checkbox.php")
# Check for CAPTCHA presence and solve if needed
if self.is_element_visible("iframe[src*='recaptcha']"):
# The CapSolver extension should handle the CAPTCHA automatically
print("CAPTCHA detected, waiting for CapSolver extension to solve it...")
# Wait for CAPTCHA to be solved
self.sleep(10) # Adjust time based on average solving time
# Proceed with scraping actions after CAPTCHA is solved
# Example action: clicking a button or extracting text
self.assert_text("reCAPTCHA demo", "h1") # Confirm page content
def tearDown(self):
# Clean up and close the browser after the test
self.driver.quit()
super().tearDown()
if __name__ == "__main__":
QuotesScraper().main()
Explanation:
-
setUp Method:
- We override the
setUp
method to configure the Chrome browser with the CapSolver extension before each test. - We specify the path to the CapSolver extension and add it to the Chrome options.
- We quit the existing driver and create a new one with the updated options.
- We override the
-
test_scrape_quotes Method:
- We navigate to the target website.
- The CapSolver extension would automatically detect and solve any CAPTCHA.
- We perform the scraping tasks as usual.
-
tearDown Method:
- We ensure the browser is closed after the test to free up resources.
Running the Script:
bash
python scrape_quotes_with_capsolver.py
Note: Even though quotes.toscrape.com doesn't have CAPTCHAs, integrating CapSolver prepares your scraper for sites that do.
Bonus Code
Claim your Bonus Code for top captcha solutions at CapSolver: scrape. After redeeming it, you will get an extra 5% bonus after each recharge, unlimited times.
Conclusion
In this guide, we've explored how to perform web scraping using SeleniumBase, covering basic scraping techniques and more advanced examples like handling pagination, dynamic content, and form submissions. We've also demonstrated how to integrate CapSolver into your SeleniumBase scripts to automatically solve CAPTCHAs, ensuring uninterrupted scraping sessions.
Compliance Disclaimer: The information provided on this blog is for informational purposes only. CapSolver is committed to compliance with all applicable laws and regulations. The use of the CapSolver network for illegal, fraudulent, or abusive activities is strictly prohibited and will be investigated. Our captcha-solving solutions enhance user experience while ensuring 100% compliance in helping solve captcha difficulties during public data crawling. We encourage responsible use of our services. For more information, please visit our Terms of Service and Privacy Policy.
More
How to Solve Cloudflare in PHP
Explore how to solve Cloudflareâs defenses effectively using PHP. Weâll compare two solutions: automation tools like Selenium Stealth and API-based solutions
Lucas Mitchell
26-Nov-2024
How to Start Web Scraping in R: A Complete Guide for 2025
Learn how to scrape data with R, set up your environment, handle dynamic content, and follow best practices for ethical scraping.
Lucas Mitchell
26-Nov-2024
Web Scraping with Botright and Python in 2025
Learn how to integrate CapSolver with Botright using the CapSolver browser extension to efficiently solve CAPTCHAs during web scraping. This comprehensive guide covers setting up Botright, creating basic scrapers, and automating CAPTCHA solving for uninterrupted data extraction.
Lucas Mitchell
14-Nov-2024
How to Solve Web Scraping Challenges with Scrapy and Playwright in 2025
Learn how to overcome web scraping challenges in 2025 using Scrapy and Playwright. This comprehensive guide explores integrating Scrapy-Playwright with CapSolver to effectively handle dynamic content and captchas, ensuring efficient and reliable data extraction.
Lucas Mitchell
12-Nov-2024
Solving reCAPTCHA with AI Recognition in 2025
Explore how AI is transforming reCAPTCHA-solving, CapSolver's solutions, and the evolving landscape of CAPTCHA security in 2025.
Ethan Collins
11-Nov-2024
Web Scraping with SeleniumBase and Python in 2024
Learn how to perform web scraping using SeleniumBase and integrate CapSolver to efficiently solve CAPTCHAs, with practical examples using quotes.toscrape.com.
Lucas Mitchell
05-Nov-2024