https://github.com/cvyl/cf-static-archive-worker
A serverless website archiving solution built with Cloudflare Workers. This tool crawls and archives static websites, storing all assets (HTML, CSS, JS, images, etc.) in Cloudflare R2 storage.
https://github.com/cvyl/cf-static-archive-worker
archiver cloudflare cloudflare-r2 cloudflare-worker cloudflare-workers web-archiving
Last synced: 11 months ago
JSON representation
A serverless website archiving solution built with Cloudflare Workers. This tool crawls and archives static websites, storing all assets (HTML, CSS, JS, images, etc.) in Cloudflare R2 storage.
- Host: GitHub
- URL: https://github.com/cvyl/cf-static-archive-worker
- Owner: cvyl
- Created: 2024-12-14T13:35:32.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-05T22:21:09.000Z (about 1 year ago)
- Last Synced: 2025-03-28T05:18:16.018Z (11 months ago)
- Topics: archiver, cloudflare, cloudflare-r2, cloudflare-worker, cloudflare-workers, web-archiving
- Language: TypeScript
- Homepage: https://backup.cvyl.me/zombo.com/2024-12-14
- Size: 32.2 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Website Archiver
A serverless website archiving solution built with Cloudflare Workers. This tool crawls and archives static websites, storing all assets (HTML, CSS, JS, images, etc.) in Cloudflare R2 storage.
## Features
- Archives entire websites including assets and internal pages
- Follows internal links and iframes
- Preserves directory structure
- Handles relative and absolute paths
- Configurable crawl depth
- Simple web interface for archiving and browsing snapshots
- REST API for programmatic access
## Usage
### Web Interface
Visit the root URL to access the web interface where you can:
- Submit websites for archiving
- Browse archived websites by domain
- View snapshots by date
### API
Archive a website:
```bash
curl -X POST https://your-worker.workers.dev/archive \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"archiveKey": "your-key-here"
}'
```
### Access Archives
- List snapshots: `https://your-worker.workers.dev/example.com`
- View specific snapshot: `https://your-worker.workers.dev/example.com/YYYY-MM-DD/index.html`
## Limitations
- Only works with static websites (HTML, CSS, JS)
- Cannot archive dynamic content (PHP, server-side rendering)
- Does not bypass security measures like:
- Cloudflare bot protection
- CAPTCHA
- IP-based blocking
- Limited by Cloudflare Workers execution time and memory limits
- External resources (CDN, APIs) remain linked to original sources
## Configuration
Key settings:
- `maxDepth`: Maximum crawl depth for internal links (default: 5)
- `ARCHIVER_KEY`: Authentication key for the API
- `STATIC_URL`: Base URL for archived content
- Worker and R2 bucket configuration in `wrangler.toml`
## Setup
1. Clone the repository
2. Install dependencies:
```bash
pnpm install
```
3. Configure your `wrangler.toml`:
```toml
name = "website-archiver"
workers_dev = true
[vars]
ARCHIVER_KEY = "your-secret-key"
STATIC_URL = "https://your-domain.com"
[[r2_buckets]]
binding = "ARCHIVE_BUCKET"
bucket_name = "your-bucket-name"
preview_bucket_name = "your-bucket-name-preview"
```
4. Deploy:
```bash
pnpm run deploy
```
## License
MIT