CAPSOLVER
Blog
Web Scraping in Golang with Colly

Web Scraping in Golang with Colly

Logo of CapSolver

Sora Fujimoto

AI Solutions Architect

04-Jul-2024

Web scraping is a method used to extract data from websites. In Golang, the Colly library is a popular tool for web scraping due to its simplicity and powerful features. This guide will take you through setting up a Golang project with Colly, building a basic scraper, handling complex data extraction scenarios, and optimizing your scrapers with concurrent requests.

Setting Up Your Golang Project

Before you begin, ensure you have Go installed on your system. Initialize your project and fetch the Colly package with these commands:

go mod init my_scraper
go get -u github.com/gocolly/colly

This sets up your project directory and installs the Colly package.

Struggling with the repeated failure to completely solve the irritating captcha?

Discover seamless automatic captcha solving with Capsolver AI-powered Auto Web Unblock technology!

Claim Your Bonus Code for top captcha solutions; CapSolver: WEBS. After redeeming it, you will get an extra 5% bonus after each recharge, Unlimited

Building a Basic Scraper

Let's create a basic scraper to extract all links from a specific Wikipedia page.

  1. Create a new file main.go and add the following code:
package main

import (
    "fmt"
    "github.com/gocolly/colly"
)

func main() {
    c := colly.NewCollector(
        colly.AllowedDomains("en.wikipedia.org"),
    )

    c.OnHTML(".mw-parser-output", func(e *colly.HTMLElement) {
        links := e.ChildAttrs("a", "href")
        fmt.Println(links)
    })

    c.Visit("https://en.wikipedia.org/wiki/Web_scraping")
}

This code initializes a new Colly collector restricted to en.wikipedia.org, then sets up a callback to find and print all links within the .mw-parser-output div of the page.

Scraping Table Data

For more complex tasks like scraping table data and writing it to a CSV file, you can use the encoding/csv package in Go:

  1. Extend main.go with the following code to scrape table data:
package main

import (
    "encoding/csv"
    "log"
    "os"
    "github.com/gocolly/colly"
)

func main() {
    fName := "data.csv"
    file, err := os.Create(fName)
    if err != nil {
        log.Fatalf("Could not create file, err: %q", err)
        return
    }
    defer file.Close()

    writer := csv.NewWriter(file)
    defer writer.Flush()

    c := colly.NewCollector()

    c.OnHTML("table.wikitable", func(e *colly.HTMLElement) {
        e.ForEach("tr", func(_ int, row *colly.HTMLElement) {
            rowData := []string{}
            row.ForEach("td", func(_ int, cell *colly.HTMLElement) {
                rowData = append(rowData, cell.Text)
            })
            writer.Write(rowData)
        })
    })

    c.Visit("https://en.wikipedia.org/wiki/List_of_programming_languages")
}

This script scrapes table data from a Wikipedia page and writes it to data.csv.

Making Concurrent Requests

To speed up scraping, you can make concurrent requests using Go's goroutines. Here's how you can scrape multiple pages concurrently:

package main

import (
    "fmt"
    "github.com/gocolly/colly"
    "sync"
)

func scrape(url string, wg *sync.WaitGroup) {
    defer wg.Done()
    
    c := colly.NewCollector()
    
    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Println("Title found:", e.Text)
    })
    
    c.Visit(url)
}

func main() {
    var wg sync.WaitGroup
    urls := []string{
        "https://en.wikipedia.org/wiki/Web_scraping",
        "https://en.wikipedia.org/wiki/Data_mining",
        "https://en.wikipedia.org/wiki/Screen_scraping",
    }

    for _, url := range urls {
        wg.Add(1)
        go scrape(url, &wg)
    }

    wg.Wait()
}

In this example, we define a scrape function that takes a URL and a wait group as arguments. The function initializes a Colly collector, sets up a callback to print the title of the page, and visits the URL. The main function creates a wait group, iterates over a list of URLs, and starts a goroutine for each URL to scrape concurrently.

By following these steps, you can build robust web scrapers in Golang using Colly, handle various scraping scenarios, and optimize performance with concurrent requests. For more detailed tutorials and advanced usage, check out resources on web scraping with Go and Colly.

Other Web Scraping Libraries for Go

In addition to Colly, there are several other excellent libraries for web scraping in Golang:

  • GoQuery: This library offers a syntax and feature set similar to jQuery, allowing you to perform web scraping operations with ease, much like you would in jQuery.
  • Ferret: A portable, extensible, and fast web scraping system designed to simplify data extraction from the web. Ferret focuses on data extraction using a unique declarative language.
  • Selenium: Known for its headless browser capabilities, Selenium is ideal for scraping dynamic content. While it doesn't have official support for Go, there is a port available that allows its use in Golang projects.

Conclusion

Web scraping is a powerful and essential skill for efficiently extracting data from websites. Using Golang and the Colly library, you can build robust scrapers that handle various data extraction scenarios, from collecting simple links to extracting complex table data and optimizing performance with concurrent requests.

In this guide, you learned how to:

  1. Set up a Golang project with the Colly library.
  2. Build a basic scraper to extract links from a webpage.
  3. Handle more complex data extraction, such as scraping table data and writing it to a CSV file.
  4. Optimize your scrapers by making concurrent requests.

By following these steps, you can create effective and efficient web scrapers in Golang, leveraging the simplicity and powerful features of Colly. For more advanced usage and detailed tutorials, explore additional resources on web scraping with Go and Colly.

Compliance Disclaimer: The information provided on this blog is for informational purposes only. CapSolver is committed to compliance with all applicable laws and regulations. The use of the CapSolver network for illegal, fraudulent, or abusive activities is strictly prohibited and will be investigated. Our captcha-solving solutions enhance user experience while ensuring 100% compliance in helping solve captcha difficulties during public data crawling. We encourage responsible use of our services. For more information, please visit our Terms of Service and Privacy Policy.

More

How to Solve CAPTCHA with Selenium and Node.js when Scraping
How to Solve CAPTCHA with Selenium and Node.js when Scraping

If you’re facing continuous CAPTCHA issues in your scraping efforts, consider using some tools and their advanced technology to ensure you have a reliable solution

The other captcha
Logo of CapSolver

Lucas Mitchell

15-Oct-2024

Solving 403 Forbidden Errors When Crawling Websites with Python
Solving 403 Forbidden Errors When Crawling Websites with Python

Learn how to overcome 403 Forbidden errors when crawling websites with Python. This guide covers IP rotation, user-agent spoofing, request throttling, authentication handling, and using headless browsers to bypass access restrictions and continue web scraping successfully.

The other captcha
Logo of CapSolver

Sora Fujimoto

01-Aug-2024

How to Use Selenium Driverless for Efficient Web Scraping
How to Use Selenium Driverless for Efficient Web Scraping

Learn how to use Selenium Driverless for efficient web scraping. This guide provides step-by-step instructions on setting up your environment, writing your first Selenium Driverless script, and handling dynamic content. Streamline your web scraping tasks by avoiding the complexities of traditional WebDriver management, making your data extraction process simpler, faster, and more portable.

The other captcha
Logo of CapSolver

Lucas Mitchell

01-Aug-2024

Scrapy vs. Selenium
Scrapy vs. Selenium: What's Best for Your Web Scraping Project

Discover the strengths and differences between Scrapy and Selenium for web scraping. Learn which tool suits your project best and how to handle challenges like CAPTCHAs.

The other captcha
Logo of CapSolver

Ethan Collins

24-Jul-2024

API vs Scraping
API vs Scraping : the best way to obtain the data

Understand the differences, pros, and cons of Web Scraping and API Scraping to choose the best data collection method. Explore CapSolver for bot challenge solutions.

The other captcha
Logo of CapSolver

Ethan Collins

15-Jul-2024

How to solve CAPTCHA With Selenium C#
How to solve CAPTCHA With Selenium C#

At the end of this tutorial, you'll have a solid understanding of How to solve CAPTCHA With Selenium C#

The other captcha
Logo of CapSolver

Rajinder Singh

10-Jul-2024