Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mrrfv/webarchive

Crawls websites and saves found URLs to a file.
https://github.com/mrrfv/webarchive

archive archiveteam archiving crawler crawling ia internet-archive scraper web-archiving web-scraping

Last synced: 2 months ago
JSON representation

Crawls websites and saves found URLs to a file.

Awesome Lists containing this project

README

        

# webArchive

Crawls websites and saves found URLs to a file.

## Usage

Install Node.js and run `npm install` in `./crawler`.

There are 2 **required** CLI arguments:

- First argument: domain to crawl
- Second argument: path to the file where the URLs should be saved

And 2 **optional** CLI arguments:

- Third argument: connection count limit. Default is `15`
- Fourth argument: redirect count limit. Default is `15`.

For example, if you want to crawl `example.com` and save found URLs to `./test.txt`, run the following command:

```bash
node ./index.js example.com test.txt
```

## Download websites in WARC format after a crawl

Use Wget: `wget --input-file=CHANGE_THIS --warc-file="warc" --force-directories --tries=10`