Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/sieep-coding/web-crawler

A simple web crawler implemented in Go.
https://github.com/sieep-coding/web-crawler

crawler go golang web-crawler

Last synced: about 4 hours ago
JSON representation

A simple web crawler implemented in Go.

Awesome Lists containing this project

README

        

## See it in action

![](https://github.com/Sieep-Coding/web-crawler/blob/main/gif.gif)

Go Web Crawler


This is a concurrent web crawler implemented in Go.
It allows you to crawl websites, extract links, and scrape specific data from the visited pages.


Features



  • Crawls web pages concurrently using goroutines

  • Extracts links from the visited pages

  • Scrapes data such as page title, meta description, meta keywords, headings, paragraphs, image URLs, external links, and table data from the visited pages

  • Supports configurable crawling depth

  • Handles relative and absolute URLs

  • Tracks visited URLs to avoid duplicate crawling

  • Provides timing information for the crawling process

  • Saves the extracted data in a well-formatted CSV file


Installation



  1. Make sure you have Go installed on your system. You can download and install Go from the official website: https://golang.org

  2. Clone this repository to your local machine:
    git clone https://github.com/sieep-coding/web-crawler.git


  3. Navigate to the project directory:
    cd web-crawler


  4. Install the required dependencies:
    go mod download



Usage



  1. Open a terminal and navigate to the project directory.

  2. Run the following command to start the web crawler:
    go run main.go <url>

    Replace <url> with the URL you want to crawl.

  3. Wait for the crawling process to complete. The crawler will display the progress and timing information in the terminal.

  4. Once the crawling is finished, the extracted data will be saved in a CSV file named crawl_results.csv in the project directory.


Customization


You can customize the web crawler according to your needs:



  • Modify the processPage function in crawler/page.go to extract additional data from the visited pages using the goquery package.

  • Extend the Crawler struct in crawler/crawler.go to include more fields for storing extracted data.

  • Customize the CSV file generation in main.go to match your desired format.

  • Implement rate limiting to avoid overloading the target website.

  • Add support for handling robots.txt and respecting crawling restrictions.

  • Integrate the crawler with a database or file storage to persist the extracted data.


License


This project is licensed under the UNLICENSE.