Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/alicewriteswrongs/website-backup-helper
A little node.js script that helps you mirror websites
https://github.com/alicewriteswrongs/website-backup-helper
Last synced: about 1 month ago
JSON representation
A little node.js script that helps you mirror websites
- Host: GitHub
- URL: https://github.com/alicewriteswrongs/website-backup-helper
- Owner: alicewriteswrongs
- Created: 2017-08-25T13:22:43.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2017-09-21T14:48:05.000Z (over 7 years ago)
- Last Synced: 2024-10-09T01:42:04.078Z (4 months ago)
- Language: JavaScript
- Size: 22.5 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Website scraper and archiver
This is a little node.js script which uses `wget` to scrape a website and
back it up to S3 (although right now I haven't implemented the backup part
yet...). It is designed to be run as a cron job (or similar) in
a continuous fashion, using wget ability to fetch only new content to save
incremental backups.## Usage
This script is meant to be run unattended, so it doesn't have a CLI.
Instead, it expects a `backup-manifest.json` file to be written to the
directory from which it is executed. This basically looks like this:```json
{
"websites": [
{
"url": "en.wikipedia.org",
"dirname": "wikipedia"
}
],
"backup_dir": "~/my_huge_backup_directory"
}
```Note: please don't try to scrape wikipedia. The `websites` array is an
array of all the sites you'd like to back up, and the `backup_dir` is an
optional location for the backups to be saved. If you don't specify one,
it will use `~/backups` instead.Note that you need to have the `wget` installed. I have only tested this
with very recent versions of nodejs, if it's not working get
[nvm](https://github.com/creationix/nvm) and install whatever the most
recent version is.