Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mrrfv/webarchive
Crawls websites and saves found URLs to a file.
https://github.com/mrrfv/webarchive
archive archiveteam archiving crawler crawling ia internet-archive scraper web-archiving web-scraping
Last synced: 2 months ago
JSON representation
Crawls websites and saves found URLs to a file.
- Host: GitHub
- URL: https://github.com/mrrfv/webarchive
- Owner: mrrfv
- License: gpl-3.0
- Created: 2022-01-21T15:17:45.000Z (almost 3 years ago)
- Default Branch: master
- Last Pushed: 2024-02-05T17:16:25.000Z (11 months ago)
- Last Synced: 2024-10-11T02:28:03.372Z (3 months ago)
- Topics: archive, archiveteam, archiving, crawler, crawling, ia, internet-archive, scraper, web-archiving, web-scraping
- Language: JavaScript
- Homepage:
- Size: 18.6 KB
- Stars: 4
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# webArchive
Crawls websites and saves found URLs to a file.
## Usage
Install Node.js and run `npm install` in `./crawler`.
There are 2 **required** CLI arguments:
- First argument: domain to crawl
- Second argument: path to the file where the URLs should be savedAnd 2 **optional** CLI arguments:
- Third argument: connection count limit. Default is `15`
- Fourth argument: redirect count limit. Default is `15`.For example, if you want to crawl `example.com` and save found URLs to `./test.txt`, run the following command:
```bash
node ./index.js example.com test.txt
```## Download websites in WARC format after a crawl
Use Wget: `wget --input-file=CHANGE_THIS --warc-file="warc" --force-directories --tries=10`