https://github.com/schollz/linkcrawler
Cross-platform persistent and distributed web crawler :link:
https://github.com/schollz/linkcrawler
crawler hyperlinks web
Last synced: about 1 year ago
JSON representation
Cross-platform persistent and distributed web crawler :link:
- Host: GitHub
- URL: https://github.com/schollz/linkcrawler
- Owner: schollz
- License: mit
- Created: 2017-03-08T17:57:34.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2017-09-09T14:56:30.000Z (over 8 years ago)
- Last Synced: 2025-04-11T19:06:42.382Z (about 1 year ago)
- Topics: crawler, hyperlinks, web
- Language: Go
- Homepage:
- Size: 59.6 KB
- Stars: 111
- Watchers: 8
- Forks: 9
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Cross-platform persistent and distributed web crawler
*linkcrawler* is persistent because the queue is stored in a remote database that is automatically re-initialized if interrupted. *linkcrawler* is distributed because multiple instances of *linkcrawler* will work on the remotely stored queue, so you can start as many crawlers as you want on separate machines to speed along the process. *linkcrawler* is also fast because it is threaded and uses connection pools.
Crawl responsibly.
# This repo has been superseded by [schollz/goredis-crawler](https://github.com/schollz/goredis-crawler)
Getting Started
===============
## Install
If you have Go installed, just do
```
go get github.com/schollz/linkcrawler/...
go get github.com/schollz/boltdb-server/...
```
Otherwise, use the releases and [download linkcrawler](https://github.com/schollz/linkcrawler/releases/latest) and then [download the boltdb-server](https://github.com/schollz/boltdb-server/releases/latest).
## Run
### Crawl a site
First run the database server which will create a LAN hub:
```sh
$ ./boltdb-server
boltdb-server running on http://X.Y.Z.W:8050
```
Then, to capture all the links on a website:
```sh
$ linkcrawler --server http://X.Y.Z.W:8050 crawl http://rpiai.com
```
Make sure to replace `http://X.Y.Z.W:8050` with the IP information outputted from the boltdb-server.
You can run this last command on as many different machines as you want, which will help to crawl the respective website and add collected links to a universal queue in the server.
The current state of the crawler is saved. If the crawler is interrupted, you can simply run the command again and it will restart from the last state.
See the help (`-help`) if you'd like to see more options, such as exclusions/inclusions and modifying the worker pool and connection pools.
### Download a site
You can also use *linkcrawler* to download webpages from a newline-delimited list of websites. As before, first startup a boltdb-server. Then you can run:
```bash
$ linkcrawler --server http://X.Y.Z.W:8050 download links.txt
```
Downloads are saved into a folder `downloaded` with URL of link encoded in Base32 and compressed using gzip.
### Dump the current list of links
To dump the current database, just use
```bash
$ linkcrawler --server http://X.Y.Z.W:8050 dump http://rpiai.com
Wrote 32 links to NB2HI4B2F4XXE4DJMFUS4Y3PNU======.txt
```
## License
MIT