https://github.com/schollz/linkcrawler

Cross-platform persistent and distributed web crawler :link:
https://github.com/schollz/linkcrawler

crawler hyperlinks web

Last synced: about 1 year ago
JSON representation

Cross-platform persistent and distributed web crawler :link:

Host: GitHub
URL: https://github.com/schollz/linkcrawler
Owner: schollz
License: mit
Created: 2017-03-08T17:57:34.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2017-09-09T14:56:30.000Z (almost 9 years ago)
Last Synced: 2025-04-11T19:06:42.382Z (about 1 year ago)
Topics: crawler, hyperlinks, web
Language: Go
Homepage:
Size: 59.6 KB
Stars: 111
Watchers: 8
Forks: 9
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

Cross-platform persistent and distributed web crawler

*linkcrawler* is persistent because the queue is stored in a remote database that is automatically re-initialized if interrupted. *linkcrawler* is distributed because multiple instances of *linkcrawler* will work on the remotely stored queue, so you can start as many crawlers as you want on separate machines to speed along the process. *linkcrawler* is also fast because it is threaded and uses connection pools.

Crawl responsibly.

# This repo has been superseded by [schollz/goredis-crawler](https://github.com/schollz/goredis-crawler)

Getting Started
===============

## Install

If you have Go installed, just do
```
go get github.com/schollz/linkcrawler/...
go get github.com/schollz/boltdb-server/...
```

Otherwise, use the releases and [download linkcrawler](https://github.com/schollz/linkcrawler/releases/latest) and then [download the boltdb-server](https://github.com/schollz/boltdb-server/releases/latest).

## Run

### Crawl a site

First run the database server which will create a LAN hub:

```sh
$ ./boltdb-server
boltdb-server running on http://X.Y.Z.W:8050
```

Then, to capture all the links on a website:

```sh
$ linkcrawler --server http://X.Y.Z.W:8050 crawl http://rpiai.com
```

Make sure to replace `http://X.Y.Z.W:8050` with the IP information outputted from the boltdb-server.

You can run this last command on as many different machines as you want, which will help to crawl the respective website and add collected links to a universal queue in the server.

The current state of the crawler is saved. If the crawler is interrupted, you can simply run the command again and it will restart from the last state.

See the help (`-help`) if you'd like to see more options, such as exclusions/inclusions and modifying the worker pool and connection pools.

### Download a site

You can also use *linkcrawler* to download webpages from a newline-delimited list of websites. As before, first startup a boltdb-server. Then you can run:

```bash
$ linkcrawler --server http://X.Y.Z.W:8050 download links.txt
```

Downloads are saved into a folder `downloaded` with URL of link encoded in Base32 and compressed using gzip.

### Dump the current list of links

To dump the current database, just use

```bash
$ linkcrawler --server http://X.Y.Z.W:8050 dump http://rpiai.com
Wrote 32 links to NB2HI4B2F4XXE4DJMFUS4Y3PNU======.txt
```

## License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/schollz/linkcrawler

Awesome Lists containing this project

README