Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/stangirard/crawlycolly

Website Crawler to extract all urls
https://github.com/stangirard/crawlycolly

colly crawler discover golang sitemap

Last synced: about 2 months ago
JSON representation

Website Crawler to extract all urls

Awesome Lists containing this project

README

        

# Fast Crawler

The purpose of this crawler is to get all the pages of a website very quickly

It uses the sitemaps of a website to discover the pages
The drawback is that if the pages aren't in the sitemap they won't be discovered.
However, it is a very fast & efficient way to get thousands of pages in seconds.

## Installation

- Install Golang

## Use the crawler

- Create the folder `urls` with the command `mkdir urls`
- Set the websites your want to crawl in a file like `urls_test`, one url per line
- Compile with `go build *.go`
- Run with `cat urls_test | ./crawl` or if not compiled `cat urls_test | go run *.go`
- The websites' urls will be writen in urls_.csv

At the end of your crawling if you want to merge all the files just run in the folder urls `for filename in $(ls *.csv); do sed 1d $filename >> ../final.csv; done`

## Disclaimer

Please be advised that even if this crawler doesn't visit every pages of a website, it can be very intensive for large websites.
Feel free to make pull request to improve the crawler