Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/alicewriteswrongs/website-backup-helper

A little node.js script that helps you mirror websites
https://github.com/alicewriteswrongs/website-backup-helper

Last synced: about 1 month ago
JSON representation

A little node.js script that helps you mirror websites

Host: GitHub
URL: https://github.com/alicewriteswrongs/website-backup-helper
Owner: alicewriteswrongs
Created: 2017-08-25T13:22:43.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2017-09-21T14:48:05.000Z (over 7 years ago)
Last Synced: 2024-10-09T01:42:04.078Z (4 months ago)
Language: JavaScript
Size: 22.5 KB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Website scraper and archiver

This is a little node.js script which uses `wget` to scrape a website and
back it up to S3 (although right now I haven't implemented the backup part
yet...). It is designed to be run as a cron job (or similar) in
a continuous fashion, using wget ability to fetch only new content to save
incremental backups.

## Usage

This script is meant to be run unattended, so it doesn't have a CLI.
Instead, it expects a `backup-manifest.json` file to be written to the
directory from which it is executed. This basically looks like this:

```json
{
"websites": [
{
"url": "en.wikipedia.org",
"dirname": "wikipedia"
}
],
"backup_dir": "~/my_huge_backup_directory"
}
```

Note: please don't try to scrape wikipedia. The `websites` array is an
array of all the sites you'd like to back up, and the `backup_dir` is an
optional location for the backups to be saved. If you don't specify one,
it will use `~/backups` instead.

Note that you need to have the `wget` installed. I have only tested this
with very recent versions of nodejs, if it's not working get
[nvm](https://github.com/creationix/nvm) and install whatever the most
recent version is.