Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/alicewriteswrongs/website-backup-helper

A little node.js script that helps you mirror websites
https://github.com/alicewriteswrongs/website-backup-helper

Last synced: about 1 month ago
JSON representation

A little node.js script that helps you mirror websites

Awesome Lists containing this project

README

        

# Website scraper and archiver

This is a little node.js script which uses `wget` to scrape a website and
back it up to S3 (although right now I haven't implemented the backup part
yet...). It is designed to be run as a cron job (or similar) in
a continuous fashion, using wget ability to fetch only new content to save
incremental backups.

## Usage

This script is meant to be run unattended, so it doesn't have a CLI.
Instead, it expects a `backup-manifest.json` file to be written to the
directory from which it is executed. This basically looks like this:

```json
{
"websites": [
{
"url": "en.wikipedia.org",
"dirname": "wikipedia"
}
],
"backup_dir": "~/my_huge_backup_directory"
}
```

Note: please don't try to scrape wikipedia. The `websites` array is an
array of all the sites you'd like to back up, and the `backup_dir` is an
optional location for the backups to be saved. If you don't specify one,
it will use `~/backups` instead.

Note that you need to have the `wget` installed. I have only tested this
with very recent versions of nodejs, if it's not working get
[nvm](https://github.com/creationix/nvm) and install whatever the most
recent version is.