Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/stangirard/crawlycolly
Website Crawler to extract all urls
https://github.com/stangirard/crawlycolly
colly crawler discover golang sitemap
Last synced: about 2 months ago
JSON representation
Website Crawler to extract all urls
- Host: GitHub
- URL: https://github.com/stangirard/crawlycolly
- Owner: StanGirard
- License: mit
- Created: 2019-12-06T23:31:48.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2020-04-20T20:59:58.000Z (over 4 years ago)
- Last Synced: 2024-10-12T19:46:48.378Z (3 months ago)
- Topics: colly, crawler, discover, golang, sitemap
- Language: Go
- Homepage: https://primates.dev
- Size: 15.6 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Fast Crawler
The purpose of this crawler is to get all the pages of a website very quickly
It uses the sitemaps of a website to discover the pages
The drawback is that if the pages aren't in the sitemap they won't be discovered.
However, it is a very fast & efficient way to get thousands of pages in seconds.## Installation
- Install Golang
## Use the crawler
- Create the folder `urls` with the command `mkdir urls`
- Set the websites your want to crawl in a file like `urls_test`, one url per line
- Compile with `go build *.go`
- Run with `cat urls_test | ./crawl` or if not compiled `cat urls_test | go run *.go`
- The websites' urls will be writen in urls_.csvAt the end of your crawling if you want to merge all the files just run in the folder urls `for filename in $(ls *.csv); do sed 1d $filename >> ../final.csv; done`
## Disclaimer
Please be advised that even if this crawler doesn't visit every pages of a website, it can be very intensive for large websites.
Feel free to make pull request to improve the crawler