Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/stangirard/crawlycolly

Website Crawler to extract all urls
https://github.com/stangirard/crawlycolly

colly crawler discover golang sitemap

Last synced: 19 days ago
JSON representation

Website Crawler to extract all urls

Host: GitHub
URL: https://github.com/stangirard/crawlycolly
Owner: StanGirard
License: mit
Created: 2019-12-06T23:31:48.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2020-04-20T20:59:58.000Z (almost 5 years ago)
Last Synced: 2024-11-24T20:50:28.221Z (2 months ago)
Topics: colly, crawler, discover, golang, sitemap
Language: Go
Homepage: https://primates.dev
Size: 15.6 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Fast Crawler

The purpose of this crawler is to get all the pages of a website very quickly

It uses the sitemaps of a website to discover the pages
The drawback is that if the pages aren't in the sitemap they won't be discovered.
However, it is a very fast & efficient way to get thousands of pages in seconds.

## Installation

- Install Golang

## Use the crawler

- Create the folder `urls` with the command `mkdir urls`
- Set the websites your want to crawl in a file like `urls_test`, one url per line
- Compile with `go build *.go`
- Run with `cat urls_test | ./crawl` or if not compiled `cat urls_test | go run *.go`
- The websites' urls will be writen in urls_.csv

At the end of your crawling if you want to merge all the files just run in the folder urls `for filename in $(ls *.csv); do sed 1d $filename >> ../final.csv; done`

## Disclaimer

Please be advised that even if this crawler doesn't visit every pages of a website, it can be very intensive for large websites.
Feel free to make pull request to improve the crawler