https://github.com/cvyl/cf-static-archive-worker

A serverless website archiving solution built with Cloudflare Workers. This tool crawls and archives static websites, storing all assets (HTML, CSS, JS, images, etc.) in Cloudflare R2 storage.
https://github.com/cvyl/cf-static-archive-worker

archiver cloudflare cloudflare-r2 cloudflare-worker cloudflare-workers web-archiving

Last synced: 11 months ago
JSON representation

A serverless website archiving solution built with Cloudflare Workers. This tool crawls and archives static websites, storing all assets (HTML, CSS, JS, images, etc.) in Cloudflare R2 storage.

Host: GitHub
URL: https://github.com/cvyl/cf-static-archive-worker
Owner: cvyl
Created: 2024-12-14T13:35:32.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-02-05T22:21:09.000Z (about 1 year ago)
Last Synced: 2025-03-28T05:18:16.018Z (11 months ago)
Topics: archiver, cloudflare, cloudflare-r2, cloudflare-worker, cloudflare-workers, web-archiving
Language: TypeScript
Homepage: https://backup.cvyl.me/zombo.com/2024-12-14
Size: 32.2 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Website Archiver

A serverless website archiving solution built with Cloudflare Workers. This tool crawls and archives static websites, storing all assets (HTML, CSS, JS, images, etc.) in Cloudflare R2 storage.

## Features

- Archives entire websites including assets and internal pages
- Follows internal links and iframes
- Preserves directory structure
- Handles relative and absolute paths
- Configurable crawl depth
- Simple web interface for archiving and browsing snapshots
- REST API for programmatic access

## Usage

### Web Interface

Visit the root URL to access the web interface where you can:

- Submit websites for archiving
- Browse archived websites by domain
- View snapshots by date

### API

Archive a website:

```bash
curl -X POST https://your-worker.workers.dev/archive \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"archiveKey": "your-key-here"
}'
```

### Access Archives

- List snapshots: `https://your-worker.workers.dev/example.com`
- View specific snapshot: `https://your-worker.workers.dev/example.com/YYYY-MM-DD/index.html`

## Limitations

- Only works with static websites (HTML, CSS, JS)
- Cannot archive dynamic content (PHP, server-side rendering)
- Does not bypass security measures like:
- Cloudflare bot protection
- CAPTCHA
- IP-based blocking
- Limited by Cloudflare Workers execution time and memory limits
- External resources (CDN, APIs) remain linked to original sources

## Configuration

Key settings:

- `maxDepth`: Maximum crawl depth for internal links (default: 5)
- `ARCHIVER_KEY`: Authentication key for the API
- `STATIC_URL`: Base URL for archived content
- Worker and R2 bucket configuration in `wrangler.toml`

## Setup

1. Clone the repository
2. Install dependencies:

```bash
pnpm install
```

3. Configure your `wrangler.toml`:

```toml
name = "website-archiver"
workers_dev = true

[vars]
ARCHIVER_KEY = "your-secret-key"
STATIC_URL = "https://your-domain.com"

[[r2_buckets]]
binding = "ARCHIVE_BUCKET"
bucket_name = "your-bucket-name"
preview_bucket_name = "your-bucket-name-preview"
```

4. Deploy:

```bash
pnpm run deploy
```

## License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/cvyl/cf-static-archive-worker

Awesome Lists containing this project

README