https://github.com/sieep-coding/web-crawler

A simple web crawler implemented in Go.
https://github.com/sieep-coding/web-crawler

crawler go golang web-crawler

Last synced: 4 months ago
JSON representation

A simple web crawler implemented in Go.

Host: GitHub
URL: https://github.com/sieep-coding/web-crawler
Owner: Sieep-Coding
License: unlicense
Created: 2024-04-28T20:20:21.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-04-30T19:55:05.000Z (about 1 year ago)
Last Synced: 2025-01-16T10:52:05.322Z (6 months ago)
Topics: crawler, go, golang, web-crawler
Language: Go
Homepage:
Size: 10.1 MB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

## See it in action

![](https://github.com/Sieep-Coding/web-crawler/blob/main/gif.gif)

Go Web Crawler

This is a concurrent web crawler implemented in Go.
It allows you to crawl websites, extract links, and scrape specific data from the visited pages.

Features

Crawls web pages concurrently using goroutines

Extracts links from the visited pages

Scrapes data such as page title, meta description, meta keywords, headings, paragraphs, image URLs, external links, and table data from the visited pages

Supports configurable crawling depth

Handles relative and absolute URLs

Tracks visited URLs to avoid duplicate crawling

Provides timing information for the crawling process

Saves the extracted data in a well-formatted CSV file

Installation

Make sure you have Go installed on your system. You can download and install Go from the official website: https://golang.org

Clone this repository to your local machine:

git clone https://github.com/sieep-coding/web-crawler.git

Navigate to the project directory:
```
cd web-crawler
```

Install the required dependencies:
```
go mod download
```

Usage

Open a terminal and navigate to the project directory.

Run the following command to start the web crawler:
```
go run main.go <url>
```
Replace <url> with the URL you want to crawl.

Wait for the crawling process to complete. The crawler will display the progress and timing information in the terminal.

Once the crawling is finished, the extracted data will be saved in a CSV file named crawl_results.csv in the project directory.

Customization

You can customize the web crawler according to your needs:

Modify the processPage function in crawler/page.go to extract additional data from the visited pages using the goquery package.

Extend the Crawler struct in crawler/crawler.go to include more fields for storing extracted data.

Customize the CSV file generation in main.go to match your desired format.

Implement rate limiting to avoid overloading the target website.

Add support for handling robots.txt and respecting crawling restrictions.

Integrate the crawler with a database or file storage to persist the extracted data.

License

This project is licensed under the UNLICENSE.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome