https://github.com/mrrfv/webarchive

Crawls websites and saves found URLs to a file.
https://github.com/mrrfv/webarchive

archive archiveteam archiving crawler crawling ia internet-archive scraper web-archiving web-scraping

Last synced: about 1 month ago
JSON representation

Crawls websites and saves found URLs to a file.

Host: GitHub
URL: https://github.com/mrrfv/webarchive
Owner: mrrfv
License: gpl-3.0
Created: 2022-01-21T15:17:45.000Z (over 3 years ago)
Default Branch: master
Last Pushed: 2024-02-05T17:16:25.000Z (about 1 year ago)
Last Synced: 2025-02-28T12:30:46.684Z (about 2 months ago)
Topics: archive, archiveteam, archiving, crawler, crawling, ia, internet-archive, scraper, web-archiving, web-scraping
Language: JavaScript
Homepage:
Size: 18.6 KB
Stars: 5
Watchers: 2
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

README

# webArchive

Crawls websites and saves found URLs to a file.

## Usage

Install Node.js and run `npm install` in `./crawler`.

There are 2 **required** CLI arguments:

- First argument: domain to crawl
- Second argument: path to the file where the URLs should be saved

And 2 **optional** CLI arguments:

- Third argument: connection count limit. Default is `15`
- Fourth argument: redirect count limit. Default is `15`.

For example, if you want to crawl `example.com` and save found URLs to `./test.txt`, run the following command:

```bash
node ./index.js example.com test.txt
```

## Download websites in WARC format after a crawl

Use Wget: `wget --input-file=CHANGE_THIS --warc-file="warc" --force-directories --tries=10`