Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sieep-coding/web-crawler
A simple web crawler implemented in Go.
https://github.com/sieep-coding/web-crawler
crawler go golang web-crawler
Last synced: about 4 hours ago
JSON representation
A simple web crawler implemented in Go.
- Host: GitHub
- URL: https://github.com/sieep-coding/web-crawler
- Owner: Sieep-Coding
- License: unlicense
- Created: 2024-04-28T20:20:21.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-04-30T19:55:05.000Z (6 months ago)
- Last Synced: 2024-04-30T22:08:03.494Z (6 months ago)
- Topics: crawler, go, golang, web-crawler
- Language: Go
- Homepage:
- Size: 10.1 MB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## See it in action
![](https://github.com/Sieep-Coding/web-crawler/blob/main/gif.gif)
Go Web Crawler
This is a concurrent web crawler implemented in Go.
It allows you to crawl websites, extract links, and scrape specific data from the visited pages.Features
- Crawls web pages concurrently using goroutines
- Extracts links from the visited pages
- Scrapes data such as page title, meta description, meta keywords, headings, paragraphs, image URLs, external links, and table data from the visited pages
- Supports configurable crawling depth
- Handles relative and absolute URLs
- Tracks visited URLs to avoid duplicate crawling
- Provides timing information for the crawling process
- Saves the extracted data in a well-formatted CSV file
Installation
- Make sure you have Go installed on your system. You can download and install Go from the official website: https://golang.org
- Clone this repository to your local machine:
git clone https://github.com/sieep-coding/web-crawler.git
- Navigate to the project directory:
cd web-crawler
- Install the required dependencies:
go mod download
Usage
- Open a terminal and navigate to the project directory.
- Run the following command to start the web crawler:
go run main.go <url>
Replace<url>
with the URL you want to crawl. - Wait for the crawling process to complete. The crawler will display the progress and timing information in the terminal.
- Once the crawling is finished, the extracted data will be saved in a CSV file named
crawl_results.csv
in the project directory.
Customization
You can customize the web crawler according to your needs:
- Modify the
processPage
function incrawler/page.go
to extract additional data from the visited pages using thegoquery
package. - Extend the
Crawler
struct incrawler/crawler.go
to include more fields for storing extracted data. - Customize the CSV file generation in
main.go
to match your desired format. - Implement rate limiting to avoid overloading the target website.
- Add support for handling robots.txt and respecting crawling restrictions.
- Integrate the crawler with a database or file storage to persist the extracted data.
License
This project is licensed under the UNLICENSE.