{"id":37147713,"url":"https://github.com/airframesio/data-archiver","last_synced_at":"2026-04-15T22:31:58.506Z","repository":{"id":312004114,"uuid":"1045944200","full_name":"airframesio/data-archiver","owner":"airframesio","description":"A high-performance CLI tool for archiving PostgreSQL partitioned table data to S3-compatible object storage.","archived":false,"fork":false,"pushed_at":"2025-11-29T18:12:54.000Z","size":32062,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-11-30T15:19:18.354Z","etag":null,"topics":["database","database-backup","object-storage","postgres","postgresql","postgresql-database","s3","s3-bucket","s3-storage","tui","tui-app","webapp"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/airframesio.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2025-08-28T00:34:13.000Z","updated_at":"2025-11-29T18:12:57.000Z","dependencies_parsed_at":"2025-08-28T08:11:49.657Z","dependency_job_id":null,"html_url":"https://github.com/airframesio/data-archiver","commit_stats":null,"previous_names":["airframesio/postgresql-archiver","airframesio/data-archiver"],"tags_count":43,"template":false,"template_full_name":null,"purl":"pkg:github/airframesio/data-archiver","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airframesio%2Fdata-archiver","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airframesio%2Fdata-archiver/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airframesio%2Fdata-archiver/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airframesio%2Fdata-archiver/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/airframesio","download_url":"https://codeload.github.com/airframesio/data-archiver/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airframesio%2Fdata-archiver/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28427511,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T16:38:47.836Z","status":"ssl_error","status_checked_at":"2026-01-14T16:34:59.695Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["database","database-backup","object-storage","postgres","postgresql","postgresql-database","s3","s3-bucket","s3-storage","tui","tui-app","webapp"],"created_at":"2026-01-14T17:25:39.575Z","updated_at":"2026-01-14T17:25:40.564Z","avatar_url":"https://github.com/airframesio.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Archiver\n\n[![CI](https://github.com/airframesio/data-archiver/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/airframesio/data-archiver/actions/workflows/ci.yml)\n[![Go Report Card](https://goreportcard.com/badge/github.com/airframesio/data-archiver)](https://goreportcard.com/report/github.com/airframesio/data-archiver)\n[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)\n[![Go Version](https://img.shields.io/github/go-mod/go-version/airframesio/data-archiver)](go.mod)\n\nA high-performance CLI tool for archiving database data to S3-compatible object storage.\n\n**Currently supports:** PostgreSQL input (partitioned tables, plus non-partitioned tables when `--date-column`, `--start-date`, and `--end-date` are provided) and S3-compatible object storage output.\n\n## Screenshots\n\n### Terminal UI (TUI)\n![Data Archiver TUI Screenshot](screenshot-tui.png)\n\n### Web-based Cache Viewer\n![Data Archiver Web UI Screenshot](screenshot-web.png)\n\n## Features\n\n- 🚀 **Parallel Processing** - Archive multiple partitions concurrently with configurable workers\n- 📊 **Beautiful Progress UI** - Real-time progress tracking with dual progress bars\n- 🌐 **Embedded Cache Viewer** - Beautiful web interface with real-time updates:\n  - **WebSocket Live Updates** - Real-time data streaming without polling\n  - Interactive task monitoring showing current partition and operation\n  - Clickable partition names to jump directly to table row\n  - Shows archiver status (running/idle) with PID tracking\n  - Live statistics: total partitions, sizes, compression ratios\n  - Sortable table with S3 upload status indicators\n  - Smooth animations highlight data changes\n  - Error tracking with timestamps\n  - Auto-reconnecting WebSocket for reliability\n- 💾 **Intelligent Caching** - Advanced caching system for maximum efficiency:\n  - Caches row counts for 24 hours (refreshed daily)\n  - Caches file metadata permanently (size, MD5, compression ratio)\n  - Tracks errors with timestamps\n  - Skip extraction/compression entirely when cached metadata matches S3\n- 🔐 **Data Integrity** - Comprehensive file integrity verification:\n  - Size comparison (both compressed and uncompressed)\n  - MD5 hash verification for single-part uploads\n  - Multipart ETag verification for large files (\u003e100MB)\n  - Automatic multipart upload for files \u003e100MB\n- ⚡ **Smart Compression** - Uses Zstandard compression with multi-core support\n- 🔄 **Intelligent Resume** - Three-level skip detection:\n  1. Fast skip using cached metadata (no extraction needed)\n  2. Skip if S3 file matches after local processing\n  3. Re-upload if size or hash differs\n- 🎯 **Flexible Partition Support** - Handles multiple partition naming formats:\n  - `table_YYYYMMDD` (e.g., `messages_20240315`)\n  - `table_pYYYYMMDD` (e.g., `messages_p20240315`)\n  - `table_YYYY_MM` (e.g., `messages_2024_03`)\n\n## 📋 Prerequisites\n\n- Go 1.22 or higher\n- PostgreSQL database with partitioned tables (format: `tablename_YYYYMMDD`) **or** non-partitioned tables when you supply `--date-column`, `--start-date`, and `--end-date` so the archiver can build synthetic windows\n- S3-compatible object storage (Hetzner, AWS S3, MinIO, etc.)\n\n## 🔧 Installation\n\n### Homebrew (macOS/Linux)\n\nThe easiest way to install on macOS or Linux:\n\n```bash\nbrew install airframesio/tap/data-archiver\n```\n\n### Pre-built Binaries\n\nDownload the latest release for your platform from the [releases page](https://github.com/airframesio/data-archiver/releases).\n\n### Go Install\n\n```bash\ngo install github.com/airframesio/data-archiver@latest\n```\n\n### Build from Source\n\n```bash\ngit clone https://github.com/airframesio/data-archiver.git\ncd data-archiver\ngo build -o data-archiver\n```\n\n## 🚀 Quick Start\n\n```bash\ndata-archiver \\\n  --db-user myuser \\\n  --db-password mypass \\\n  --db-name mydb \\\n  --table flights \\\n  --s3-endpoint https://fsn1.your-objectstorage.com \\\n  --s3-bucket my-archive-bucket \\\n  --s3-access-key YOUR_ACCESS_KEY \\\n  --s3-secret-key YOUR_SECRET_KEY \\\n  --path-template \"archives/{table}/{YYYY}/{MM}\" \\\n  --start-date 2024-01-01 \\\n  --end-date 2024-01-31\n```\n\n**Advanced Example with Custom Output:**\n```bash\ndata-archiver \\\n  --db-user myuser \\\n  --db-password mypass \\\n  --db-name mydb \\\n  --table flights \\\n  --s3-endpoint https://fsn1.your-objectstorage.com \\\n  --s3-bucket my-archive-bucket \\\n  --s3-access-key YOUR_ACCESS_KEY \\\n  --s3-secret-key YOUR_SECRET_KEY \\\n  --path-template \"data/{table}/year={YYYY}/month={MM}\" \\\n  --output-format parquet \\\n  --compression lz4 \\\n  --compression-level 5 \\\n  --output-duration daily \\\n  --start-date 2024-01-01 \\\n  --end-date 2024-01-31\n```\n\n## 🎯 Usage\n\n### Basic Command Structure\n\n```bash\n# Archive data to S3\ndata-archiver archive [flags]\n\n# Dump database using pg_dump to S3\ndata-archiver dump [flags]\n\n# Dump schema once, then emit date-sliced data dumps via pg_dump\ndata-archiver dump-hybrid [flags]\n\n# Restore data from S3\ndata-archiver restore [flags]\n```\n\n### Help Output\n\n```\nData Archiver\n\nA CLI tool to efficiently archive database data to object storage.\nSupports multiple output formats (JSONL/CSV/Parquet), compression types (Zstandard/LZ4/Gzip),\nand flexible path templates for S3-compatible storage.\n\nUsage:\n  data-archiver [flags]\n\nFlags:\n      --viewer                       start embedded cache viewer web server\n      --compression string           compression type: zstd, lz4, gzip, none (default \"zstd\")\n      --compression-level int        compression level (zstd: 1-22, lz4/gzip: 1-9, none: 0) (default 3)\n      --config string                config file (default is $HOME/.data-archiver.yaml)\n      --date-column string           timestamp column name for duration-based splitting (optional)\n      --db-host string               PostgreSQL host (default \"localhost\")\n      --db-name string               PostgreSQL database name\n      --db-password string           PostgreSQL password\n      --db-port int                  PostgreSQL port (default 5432)\n      --db-sslmode string            PostgreSQL SSL mode (disable, require, verify-ca, verify-full) (default \"disable\")\n      --db-user string               PostgreSQL user\n  -d, --debug                        enable debug output\n      --dry-run                      perform a dry run without uploading\n      --end-date string              end date (YYYY-MM-DD) (default \"2025-08-27\")\n  -h, --help                         help for data-archiver\n      --output-duration string       output file duration: hourly, daily, weekly, monthly, yearly (default \"daily\")\n      --output-format string         output format: jsonl, csv, parquet (default \"jsonl\")\n      --path-template string         S3 path template with placeholders: {table}, {YYYY}, {MM}, {DD}, {HH} (required)\n      --s3-access-key string         S3 access key\n      --s3-bucket string             S3 bucket name\n      --s3-endpoint string           S3-compatible endpoint URL\n      --s3-region string             S3 region (default \"auto\")\n      --s3-secret-key string         S3 secret key\n      --skip-count                   skip counting rows (faster startup, no progress bars)\n      --start-date string            start date (YYYY-MM-DD)\n      --table string                 base table name (required)\n      --viewer-port int              port for cache viewer web server (default 8080)\n      --workers int                  number of parallel workers (default 4)\n```\n\n### Required Flags\n\n- `--table` - Base table name (without date suffix)\n- `--path-template` - S3 path template with placeholders (e.g., `\"archives/{table}/{YYYY}/{MM}\"`)\n- `--db-user` - PostgreSQL username\n- `--db-name` - PostgreSQL database name\n- `--s3-endpoint` - S3-compatible endpoint URL\n- `--s3-bucket` - S3 bucket name\n- `--s3-access-key` - S3 access key\n- `--s3-secret-key` - S3 secret key\n\n### Output Configuration Flags\n\n- `--output-format` - Output file format: `jsonl` (default), `csv`, or `parquet`\n- `--compression` - Compression type: `zstd` (default), `lz4`, `gzip`, or `none`\n- `--compression-level` - Compression level (default: 3)\n  - Zstandard: 1-22 (higher = better compression, slower)\n  - LZ4/Gzip: 1-9 (higher = better compression, slower)\n- `--output-duration` - File duration: `hourly`, `daily` (default), `weekly`, `monthly`, or `yearly`\n- `--date-column` - Timestamp column for duration-based splitting. Required when archiving non-partitioned tables so the archiver can build synthetic windows.\n- `--chunk-size` - Number of rows to process per chunk (default: 10000, range: 100-1000000)\n  - Tune based on average row size for optimal memory usage\n  - Smaller chunks for large rows, larger chunks for small rows\n\n### Working with Non-Partitioned Tables\n\nThe archive command now supports base tables that aren't physically partitioned. Provide:\n\n- `--date-column` so rows can be filtered by time\n- `--start-date` and `--end-date` to define the overall range\n- `--output-duration` to control the slice size (daily, weekly, etc.)\n\nWhen no partitions are discovered, the archiver automatically slices the base table into synthetic windows covering the requested range and streams each window through the normal extraction/compression/upload pipeline.\n\n### Hybrid pg_dump workflow\n\nUse `data-archiver dump-hybrid` when you need a schema dump plus partitioned data files generated directly by `pg_dump`.\n\n- Step 1: Dump the parent table schema once (partitions are automatically excluded).\n- Step 2: Discover partitions that match the provided date range and upload grouped dumps using `--path-template` + `--output-duration`.\n- Any dates not covered by physical partitions are still exported by filtering the parent table via temporary staging tables, so mixed partition/non-partition layouts are supported.\n- Ideal for storing schema metadata next to date-windowed `pg_dump` archives without manual SQL.\n- Requires `--date-column` so non-partitioned tables can be filtered via `pg_dump --where`.\n\nExample: dump the `events` table schema plus daily data files for January 2024.\n\n```bash\ndata-archiver dump-hybrid \\\n  --db-host localhost \\\n  --db-port 5432 \\\n  --db-user myuser \\\n  --db-password mypass \\\n  --db-name analytics \\\n  --table events \\\n  --date-column created_at \\\n  --start-date 2024-01-01 \\\n  --end-date 2024-01-31 \\\n  --s3-endpoint https://s3.example.com \\\n  --s3-bucket pg-dumps \\\n  --s3-access-key YOUR_KEY \\\n  --s3-secret-key YOUR_SECRET \\\n  --path-template \"pgdumps/{table}/{YYYY}/{MM}\" \\\n  --output-duration daily\n```\n\nThe schema is uploaded once (e.g., `pgdumps/events/schema.dump`), then `pg_dump` writes compressed custom-format data for each daily/weekly/monthly grouping such as `pgdumps/events/2024/01/events-2024-01-05.dump`.\n\n## ⚙️ Configuration\n\nThe tool supports three configuration methods (in order of precedence):\n\n1. **Command-line flags** (highest priority)\n2. **Environment variables** (prefix: `ARCHIVE_`)\n3. **Configuration file** (lowest priority)\n\n### Environment Variables\n\n```bash\nexport ARCHIVE_DB_HOST=localhost\nexport ARCHIVE_DB_PORT=5432\nexport ARCHIVE_DB_USER=myuser\nexport ARCHIVE_DB_PASSWORD=mypass\nexport ARCHIVE_DB_NAME=mydb\nexport ARCHIVE_DB_SSLMODE=disable\nexport ARCHIVE_S3_ENDPOINT=https://fsn1.your-objectstorage.com\nexport ARCHIVE_S3_BUCKET=my-bucket\nexport ARCHIVE_S3_ACCESS_KEY=your_key\nexport ARCHIVE_S3_SECRET_KEY=your_secret\nexport ARCHIVE_S3_PATH_TEMPLATE=\"archives/{table}/{YYYY}/{MM}\"\nexport ARCHIVE_TABLE=flights\nexport ARCHIVE_OUTPUT_FORMAT=jsonl           # Options: jsonl, csv, parquet\nexport ARCHIVE_COMPRESSION=zstd              # Options: zstd, lz4, gzip, none\nexport ARCHIVE_COMPRESSION_LEVEL=3           # zstd: 1-22, lz4/gzip: 1-9\nexport ARCHIVE_OUTPUT_DURATION=daily         # Options: hourly, daily, weekly, monthly, yearly\nexport ARCHIVE_WORKERS=8\nexport ARCHIVE_CACHE_VIEWER=true\nexport ARCHIVE_VIEWER_PORT=8080\n```\n\n### Configuration File\n\nCreate `~/.data-archiver.yaml`:\n\n```yaml\ndb:\n  host: localhost\n  port: 5432\n  user: myuser\n  password: mypass\n  name: mydb\n  sslmode: disable  # Options: disable, require, verify-ca, verify-full\n\ns3:\n  endpoint: https://fsn1.your-objectstorage.com\n  bucket: my-archive-bucket\n  access_key: your_access_key\n  secret_key: your_secret_key\n  region: auto\n  path_template: \"archives/{table}/{YYYY}/{MM}\"  # S3 path template with placeholders\n\ntable: flights\noutput_format: jsonl          # Options: jsonl, csv, parquet\ncompression: zstd             # Options: zstd, lz4, gzip, none\ncompression_level: 3          # zstd: 1-22, lz4/gzip: 1-9\noutput_duration: daily        # Options: hourly, daily, weekly, monthly, yearly\nworkers: 8\nstart_date: \"2024-01-01\"\nend_date: \"2024-12-31\"\ncache_viewer: false           # Enable embedded cache viewer\nviewer_port: 8080             # Port for cache viewer web server\n```\n\n## 📁 Output Structure\n\nFiles are organized in S3 based on your configured `--path-template`. The tool supports flexible path templates with the following placeholders:\n\n- `{table}` - Table name\n- `{YYYY}` - 4-digit year\n- `{MM}` - 2-digit month\n- `{DD}` - 2-digit day\n- `{HH}` - 2-digit hour (for hourly duration)\n\n**Example with default settings** (`--path-template \"archives/{table}/{YYYY}/{MM}\" --output-format jsonl --compression zstd --output-duration daily`):\n\n```\nbucket/\n└── archives/\n    └── flights/\n        └── 2024/\n            └── 01/\n                ├── flights-2024-01-01.jsonl.zst\n                ├── flights-2024-01-02.jsonl.zst\n                └── flights-2024-01-03.jsonl.zst\n```\n\n**Example with Parquet and LZ4** (`--path-template \"data/{table}/year={YYYY}/month={MM}\" --output-format parquet --compression lz4`):\n\n```\nbucket/\n└── data/\n    └── flights/\n        └── year=2024/\n            └── month=01/\n                ├── flights-2024-01-01.parquet.lz4\n                ├── flights-2024-01-02.parquet.lz4\n                └── flights-2024-01-03.parquet.lz4\n```\n\n**Example with uncompressed CSV** (`--path-template \"{table}/{YYYY}\" --output-format csv --compression none --output-duration monthly`):\n\n```\nbucket/\n└── flights/\n    └── 2024/\n        ├── flights-2024-01.csv\n        ├── flights-2024-02.csv\n        └── flights-2024-03.csv\n```\n\n## 🎨 Features in Detail\n\n### Cache Viewer Web Interface\n\nThe archiver includes an embedded web server for monitoring cache and progress:\n\n```bash\n# Start archiver with embedded cache viewer\ndata-archiver --viewer --viewer-port 8080 [other options]\n\n# Or run standalone cache viewer\ndata-archiver viewer --port 8080\n```\n\nFeatures:\n- **WebSocket Real-time Updates**: Live data streaming with automatic reconnection\n- **Interactive Status Panel**:\n  - Shows current partition being processed with clickable link\n  - Displays specific operation (e.g., \"Checking if exists\", \"Extracting\", \"Compressing\", \"Uploading\")\n  - Progress bar with completion percentage and partition count\n  - Elapsed time tracking\n- **Visual Change Detection**: Smooth animations highlight updated cells and stats\n- **S3 Upload Status**: Shows which files are uploaded vs only processed locally\n- **Comprehensive Metrics**: Shows both compressed and uncompressed sizes\n- **Compression Ratios**: Visual display of space savings\n- **Error Tracking**: Displays last error and timestamp for failed partitions\n- **Smart Rendering**: No page flashing - only updates changed values\n- **Sortable Columns**: Click any column header to sort (default: partition name)\n- **File Counts**: Shows total partitions, processed, uploaded, and errors\n- **Process Monitoring**: Checks if archiver is currently running via PID\n- **Connection Status**: Visual indicator shows WebSocket connection state\n\nAccess the viewer at `http://localhost:8080` (or your configured port).\n\n#### Technical Details\n\nThe cache viewer uses modern web technologies for optimal performance:\n- **WebSocket Protocol**: Bi-directional communication for instant updates\n- **Automatic Reconnection**: Reconnects every 2 seconds if connection drops\n- **Event-Driven File Monitoring**: Uses fsnotify for instant file change detection\n- **Efficient Updates**: Only transmits and renders changed data\n- **No Polling Overhead**: WebSocket eliminates the need for HTTP polling\n\n### Interactive Progress Display\n\nThe tool features a beautiful terminal UI with:\n- **Per-partition progress bar**: Shows real-time progress for data extraction, compression, and upload\n- **Overall progress bar**: Tracks completion across all partitions\n- **Live statistics**: Displays elapsed time, estimated remaining time, and recent completions\n- **Row counter**: Shows progress through large tables during extraction\n\n### Partition Discovery\n\nThe tool automatically discovers partitions matching these naming patterns:\n\n1. **Daily partitions (standard)**: `{base_table}_YYYYMMDD`\n   - Example: `flights_20240101`, `flights_20240102`\n\n2. **Daily partitions (with prefix)**: `{base_table}_pYYYYMMDD`\n   - Example: `flights_p20240101`, `flights_p20240102`\n\n3. **Monthly partitions**: `{base_table}_YYYY_MM`\n   - Example: `flights_2024_01`, `flights_2024_02`\n   - Note: Monthly partitions are processed as the first day of the month\n\nFor example, if your base table is `flights`, the tool will find and process all of these:\n- `flights_20240101` (daily)\n- `flights_p20240102` (daily with prefix)\n- `flights_2024_01` (monthly)\n\n### JSONL Format\n\nEach row from the partition is exported as a single JSON object on its own line:\n\n```json\n{\"id\":1,\"flight_number\":\"AA123\",\"departure\":\"2024-01-01T10:00:00Z\"}\n{\"id\":2,\"flight_number\":\"UA456\",\"departure\":\"2024-01-01T11:00:00Z\"}\n```\n\n### Compression\n\nUses Facebook's Zstandard compression with:\n- Multi-core parallel compression\n- \"Better Compression\" preset for optimal size/speed balance\n- Typically achieves 5-10x compression ratios on JSON data\n\n### Skip Logic\n\nFiles are skipped if:\n- They already exist in S3 with the same path\n- The file size matches (prevents re-uploading identical data)\n\n## 🐛 Debugging\n\nEnable debug mode for detailed output:\n\n```bash\ndata-archiver --debug --table flights ...\n```\n\nDebug mode shows:\n- Database connection details\n- Discovered partitions and row counts\n- Extraction progress (every 10,000 rows)\n- Compression ratios\n- Upload destinations\n- Detailed error messages\n\n## 🏃 Dry Run Mode\n\nTest your configuration without uploading:\n\n```bash\ndata-archiver --dry-run --table flights ...\n```\n\nThis will:\n- Connect to the database\n- Discover partitions\n- Extract and compress data\n- Calculate file sizes and MD5 hashes\n- Skip the actual upload\n\n## 💾 Caching System\n\nThe archiver uses an intelligent two-tier caching system to maximize performance:\n\n### Row Count Cache\n- Caches partition row counts for 24 hours\n- Speeds up progress bar initialization\n- Always recounts today's partition for accuracy\n- Cache files live under `~/.data-archiver/cache/` and are namespaced by the subcommand plus the absolute S3 destination (for example `archive_events_a1b2c3d4_metadata.json`)\n\n### File Metadata Cache\n- Caches compressed/uncompressed sizes, MD5 hash, and S3 upload status\n- Tracks whether files have been successfully uploaded to S3\n- Enables fast skipping without extraction/compression on subsequent runs\n- Validates against S3 metadata before skipping\n- Preserves all metadata when updating row counts\n- Stores error messages with timestamps for failed uploads\n- File metadata is kept permanently (only row counts expire after 24 hours)\n- Applies to `archive`, `dump` (when using `--start-date/--end-date` and `--output-duration`), and the date-windowed `dump-hybrid` step so reruns skip windows that already exist in S3\n\n### Cache Efficiency\nOn subsequent runs with cached metadata:\n1. Check cached size/MD5 against S3 (milliseconds)\n2. Skip extraction and compression if match found\n3. Result: 100-1000x faster for already-processed partitions\n\n## 📊 Process Monitoring\n\nThe archiver provides real-time monitoring capabilities:\n\n### PID Tracking\n- Creates PID file at `~/.data-archiver/archiver.pid` when running\n- Allows external tools to check if archiver is active\n- Automatically cleaned up on exit\n\n### Task Progress File\n- Writes current task details to `~/.data-archiver/current_task.json`\n- Includes:\n  - Current operation (connecting, counting, extracting, uploading)\n  - Progress percentage\n  - Total and completed partitions\n  - Start time and last update time\n- Updated in real-time during processing\n\n### Web API Endpoints\nThe cache viewer provides REST API and WebSocket endpoints:\n- `/api/cache` - Returns all cached metadata (REST)\n- `/api/status` - Returns archiver running status and current task (REST)\n- `/ws` - WebSocket endpoint for real-time updates\n  - Sends cache updates when files change\n  - Streams status updates during archiving\n  - Automatic reconnection support\n\n## 🔐 Data Integrity Verification\n\nThe archiver ensures data integrity through multiple verification methods:\n\n### Single-Part Uploads (files \u003c100MB)\n- Calculates MD5 hash of compressed data\n- Compares with S3 ETag (which is MD5 for single-part uploads)\n- Only skips if both size and MD5 match exactly\n\n### Multipart Uploads (files ≥100MB)\n- Automatically uses multipart upload for large files\n- Calculates multipart ETag using S3's algorithm\n- Verifies size and multipart ETag match before skipping\n\n### Verification Process\n1. **First Run**: Extract → Compress → Calculate MD5 → Upload → Cache metadata\n2. **Subsequent Runs with Cache**: Check cache → Compare with S3 → Skip if match\n3. **Subsequent Runs without Cache**: Extract → Compress → Calculate MD5 → Compare with S3 → Skip or upload\n\n## 🔍 Examples\n\n### Archive Last 30 Days\n\n```bash\ndata-archiver \\\n  --table events \\\n  --start-date $(date -d '30 days ago' +%Y-%m-%d) \\\n  --config ~/.archive-config.yaml\n```\n\n### Archive Specific Month with Debug\n\n```bash\ndata-archiver \\\n  --table transactions \\\n  --start-date 2024-06-01 \\\n  --end-date 2024-06-30 \\\n  --debug \\\n  --workers 8\n```\n\n### Dry Run with Custom Config\n\n```bash\ndata-archiver \\\n  --config production.yaml \\\n  --table orders \\\n  --dry-run \\\n  --debug\n```\n\n## 💾 Dump Command\n\nThe `dump` subcommand uses PostgreSQL's `pg_dump` utility to create database dumps with custom format and heavy compression, streaming directly to S3.\n\n### Dump Basic Usage\n\n```bash\ndata-archiver dump \\\n  --db-user myuser \\\n  --db-password mypass \\\n  --db-name mydb \\\n  --s3-endpoint https://fsn1.your-objectstorage.com \\\n  --s3-bucket my-archive-bucket \\\n  --s3-access-key YOUR_ACCESS_KEY \\\n  --s3-secret-key YOUR_SECRET_KEY \\\n  --path-template \"dumps/{table}/{YYYY}/{MM}\"\n```\n\n### Dump with Schema Only\n\n```bash\ndata-archiver dump \\\n  --db-user myuser \\\n  --db-password mypass \\\n  --db-name mydb \\\n  --dump-mode schema-only \\\n  --s3-endpoint https://fsn1.your-objectstorage.com \\\n  --s3-bucket my-archive-bucket \\\n  --s3-access-key YOUR_ACCESS_KEY \\\n  --s3-secret-key YOUR_SECRET_KEY \\\n  --path-template \"dumps/{table}/{YYYY}/{MM}\"\n```\n\n### Dump with Data Only\n\n```bash\ndata-archiver dump \\\n  --db-user myuser \\\n  --db-password mypass \\\n  --db-name mydb \\\n  --dump-mode data-only \\\n  --workers 8 \\\n  --s3-endpoint https://fsn1.your-objectstorage.com \\\n  --s3-bucket my-archive-bucket \\\n  --s3-access-key YOUR_ACCESS_KEY \\\n  --s3-secret-key YOUR_SECRET_KEY \\\n  --path-template \"dumps/{table}/{YYYY}/{MM}\"\n```\n\n### Dump Specific Table\n\n```bash\ndata-archiver dump \\\n  --db-user myuser \\\n  --db-password mypass \\\n  --db-name mydb \\\n  --table flights \\\n  --workers 4 \\\n  --s3-endpoint https://fsn1.your-objectstorage.com \\\n  --s3-bucket my-archive-bucket \\\n  --s3-access-key YOUR_ACCESS_KEY \\\n  --s3-secret-key YOUR_SECRET_KEY \\\n  --path-template \"dumps/{table}/{YYYY}/{MM}\"\n```\n\n### Dump Flags\n\n- `--db-host` - PostgreSQL host (default: localhost)\n- `--db-port` - PostgreSQL port (default: 5432)\n- `--db-user` - PostgreSQL user (required)\n- `--db-password` - PostgreSQL password (required)\n- `--db-name` - PostgreSQL database name (required)\n- `--db-sslmode` - PostgreSQL SSL mode (disable, require, verify-ca, verify-full)\n- `--s3-endpoint` - S3-compatible endpoint URL (required)\n- `--s3-bucket` - S3 bucket name (required)\n- `--s3-access-key` - S3 access key (required)\n- `--s3-secret-key` - S3 secret key (required)\n- `--s3-region` - S3 region (default: auto)\n- `--path-template` - S3 path template with placeholders (required)\n- `--table` - Table name to dump (optional, dumps entire database if not specified)\n- `--workers` - Number of parallel jobs for pg_dump (default: 4)\n- `--dump-mode` - Dump mode: `schema-only`, `data-only`, or `schema-and-data` (default: schema-and-data)\n\n### Dump Features\n\n- **Custom Format**: Uses PostgreSQL's custom format (`-Fc`) which supports parallel dumps and compression\n- **Heavy Compression**: Uses maximum compression level (`-Z 9`) for optimal file size\n- **Parallel Processing**: Honors the `--workers` flag to run multiple parallel jobs\n- **Streaming Upload**: Streams output directly to S3 without creating intermediate files\n- **Flexible Modes**: Supports schema-only, data-only, or both schema and data\n- **Table-Specific**: Can dump individual tables or entire databases\n- **Schema-Only Optimization**: For schema-only dumps:\n  - Automatically discovers and dumps only top-level tables (excludes partitions)\n  - Partitions share the same schema as their parent table, so scanning them is unnecessary\n  - Use `--table` flag to dump a specific table's schema\n  - Without `--table`, dumps schemas for all top-level tables\n- **Automatic Naming**: Generates filenames with timestamp and mode suffix (e.g., `flights-schema-20240115-120000.dump`)\n\n### Dump Examples\n\n**Dump entire database:**\n```bash\ndata-archiver dump \\\n  --db-user myuser \\\n  --db-password mypass \\\n  --db-name mydb \\\n  --path-template \"dumps/{table}/{YYYY}/{MM}\"\n```\n\n**Dump specific table with parallel processing:**\n```bash\ndata-archiver dump \\\n  --db-user myuser \\\n  --db-password mypass \\\n  --db-name mydb \\\n  --table flights \\\n  --workers 8 \\\n  --path-template \"dumps/{table}/{YYYY}/{MM}\"\n```\n\n**Dry run dump (validate without uploading):**\n```bash\ndata-archiver dump \\\n  --db-user myuser \\\n  --db-password mypass \\\n  --db-name mydb \\\n  --dry-run \\\n  --path-template \"dumps/{table}/{YYYY}/{MM}\"\n```\n\n## 🔄 Restore Command\n\nThe `restore` subcommand reverses the archive process: downloads files from S3, decompresses them, parses formats (JSONL/CSV/Parquet), and inserts data into PostgreSQL tables with automatic table/partition creation.\n\n### Restore Basic Usage\n\n```bash\ndata-archiver restore \\\n  --table flights \\\n  --path-template \"archives/{table}/{YYYY}/{MM}\" \\\n  --start-date 2024-01-01 \\\n  --end-date 2024-01-31\n```\n\n### Restore with Partitioning\n\n```bash\ndata-archiver restore \\\n  --table flights \\\n  --path-template \"archives/{table}/{YYYY}/{MM}\" \\\n  --table-partition-range daily \\\n  --start-date 2024-01-01 \\\n  --end-date 2024-01-31\n```\n\n### Restore Flags\n\n- `--table` - Base table name (required)\n- `--path-template` - S3 path template matching archive configuration (required)\n- `--start-date` - Start date filter (YYYY-MM-DD, optional)\n- `--end-date` - End date filter (YYYY-MM-DD, optional)\n- `--table-partition-range` - Partition range: `hourly`, `daily`, `monthly`, `quarterly`, `yearly` (optional)\n- `--output-format` - Override format detection: `jsonl`, `csv`, `parquet` (optional, auto-detected from file extensions)\n- `--compression` - Override compression detection: `zstd`, `lz4`, `gzip`, `none` (optional, auto-detected from file extensions)\n\n### Restore Features\n\n- **Automatic Format Detection**: Detects format and compression from file extensions (`.jsonl.zst`, `.csv.lz4`, `.parquet.gz`, etc.)\n- **Automatic Table Creation**: Creates tables automatically if they don't exist, inferring schema from data\n- **Partition Support**: Automatically creates partitions based on `--table-partition-range`:\n  - `hourly`: Creates partitions like `table_2024010115`\n  - `daily`: Creates partitions like `table_20240101`\n  - `monthly`: Creates partitions like `table_202401`\n  - `quarterly`: Creates partitions like `table_2024Q1`\n  - `yearly`: Creates partitions like `table_2024`\n- **Conflict Handling**: Uses `ON CONFLICT DO NOTHING` to skip existing rows\n- **Date Range Filtering**: Only restores files matching the specified date range\n- **Sequential Processing**: Processes files one at a time (parallel support may be added later)\n\n### Restore Examples\n\n**Restore all files for a table:**\n```bash\ndata-archiver restore \\\n  --table flights \\\n  --path-template \"archives/{table}/{YYYY}/{MM}\"\n```\n\n**Restore specific date range with daily partitions:**\n```bash\ndata-archiver restore \\\n  --table flights \\\n  --path-template \"archives/{table}/{YYYY}/{MM}\" \\\n  --table-partition-range daily \\\n  --start-date 2024-01-01 \\\n  --end-date 2024-01-31\n```\n\n**Restore with format override:**\n```bash\ndata-archiver restore \\\n  --table flights \\\n  --path-template \"archives/{table}/{YYYY}/{MM}\" \\\n  --output-format parquet \\\n  --compression zstd\n```\n\n**Dry run restore (validate without inserting):**\n```bash\ndata-archiver restore \\\n  --table flights \\\n  --path-template \"archives/{table}/{YYYY}/{MM}\" \\\n  --dry-run\n```\n\n## 🚨 Error Handling\n\nThe tool provides detailed error messages for common issues:\n\n- **Database Connection**: Checks connectivity before processing\n- **Partition Discovery**: Reports invalid partition formats\n- **Data Extraction**: Handles large datasets with streaming\n- **Compression**: Reports compression failures and ratios\n- **S3 Upload**: Retries on transient failures\n- **Configuration**: Validates all required parameters\n\n## 📊 Performance Tips\n\n1. **Increase Workers**: Use `--workers` to process more partitions in parallel\n2. **Network**: Ensure good bandwidth to S3 endpoint\n3. **Database**: Add indexes on date columns for faster queries\n4. **Memory Management**:\n   - Tool uses streaming architecture with constant ~150 MB memory footprint\n   - Memory usage independent of partition size (no OOM on multi-GB partitions)\n   - Tune `--chunk-size` based on average row size:\n     - Small rows (~1 KB): `--chunk-size 50000` (~50 MB)\n     - Medium rows (~10 KB): `--chunk-size 10000` (~100 MB, default)\n     - Large rows (~100 KB): `--chunk-size 1000` (~100 MB)\n     - Very large rows (1+ MB): `--chunk-size 100` (~100 MB)\n5. **Compression**: Multi-core zstd scales with CPU cores\n\n## 🧪 Testing\n\nThe project includes a comprehensive test suite covering:\n\n- **Cache Operations**: Row count and file metadata caching, TTL expiration, legacy migration\n- **Configuration Validation**: Required fields, default values, date formats\n- **Process Management**: PID file operations, task tracking, process status checks\n\nRun tests with:\n\n```bash\n# Run all tests\ngo test ./...\n\n# Run with verbose output\ngo test -v ./...\n\n# Run with coverage\ngo test -cover ./...\n\n# Run specific tests\ngo test -run TestPartitionCache ./cmd\n```\n\n## 🐳 Docker Support\n\nBuild and run with Docker:\n\n```bash\n# Build the Docker image\ndocker build -t data-archiver .\n\n# Run with environment variables\ndocker run --rm \\\n  -e ARCHIVE_DB_HOST=host.docker.internal \\\n  -e ARCHIVE_DB_USER=myuser \\\n  -e ARCHIVE_DB_PASSWORD=mypass \\\n  -e ARCHIVE_DB_NAME=mydb \\\n  -e ARCHIVE_S3_ENDPOINT=https://s3.example.com \\\n  -e ARCHIVE_S3_BUCKET=my-bucket \\\n  -e ARCHIVE_S3_ACCESS_KEY=key \\\n  -e ARCHIVE_S3_SECRET_KEY=secret \\\n  -e ARCHIVE_TABLE=events \\\n  data-archiver\n\n# Run with config file\ndocker run --rm \\\n  -v ~/.data-archiver.yaml:/root/.data-archiver.yaml \\\n  data-archiver\n```\n\n## 🔧 Development\n\n### Prerequisites\n\n- Go 1.21+\n- PostgreSQL database for testing\n- S3-compatible storage for testing\n\n### Building from Source\n\n```bash\n# Clone the repository\ngit clone https://github.com/airframesio/data-archiver.git\ncd data-archiver\n\n# Install Go dependencies\ngo mod download\n\n# Install Node.js dependencies (for web asset minification)\nnpm install\n\n# Minify web assets (CSS, JavaScript, HTML)\nnpm run minify\n\n# Build the binary (with minified assets embedded)\ngo build -o data-archiver\n\n# Or use the npm build script which minifies and builds in one command\nnpm run build\n\n# Run tests\ngo test ./...\n\n# Build for different platforms\nGOOS=linux GOARCH=amd64 go build -o data-archiver-linux-amd64\nGOOS=darwin GOARCH=arm64 go build -o data-archiver-darwin-arm64\nGOOS=windows GOARCH=amd64 go build -o data-archiver.exe\n```\n\n### Web Asset Minification\n\nThe cache viewer web UI uses minified assets in production builds to reduce load times:\n\n- **Original size**: 98,389 bytes (HTML + CSS + JS + design system)\n- **Minified size**: 60,995 bytes (38% reduction)\n\nThe minification process:\n1. Uses `csso-cli` for CSS minification\n2. Uses `terser` for JavaScript minification with mangling and compression\n3. Uses `html-minifier-terser` for HTML minification\n4. Automatically runs in CI/CD before building binaries\n\nTo minify manually:\n```bash\n# Run the minification script\n./scripts/minify.sh\n\n# Or use npm\nnpm run minify\n```\n\nThe minified files are automatically embedded into the Go binary during build.\n\n### CI/CD\n\nThe project uses GitHub Actions for continuous integration:\n\n- **Test Matrix**: Tests on Go 1.21.x and 1.22.x\n- **Platforms**: Linux, macOS, Windows\n- **Coverage**: Runs tests with coverage reporting\n- **Linting**: Ensures code quality with golangci-lint\n- **Binary Builds**: Creates binaries for multiple platforms\n\n## 🤝 Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/amazing-feature`)\n3. Commit your changes (`git commit -m 'Add some amazing feature'`)\n4. Push to the branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\n## 📄 License\n\nMIT License - see LICENSE file for details\n\n## 🙏 Acknowledgments\n\nBuilt with these awesome libraries:\n- [Charmbracelet](https://github.com/charmbracelet) - Beautiful CLI components\n- [Cobra](https://github.com/spf13/cobra) - CLI framework\n- [Viper](https://github.com/spf13/viper) - Configuration management\n- [klauspost/compress](https://github.com/klauspost/compress) - Fast zstd compression\n- [AWS SDK for Go](https://github.com/aws/aws-sdk-go) - S3 integration\n- [Gorilla WebSocket](https://github.com/gorilla/websocket) - WebSocket implementation\n- [fsnotify](https://github.com/fsnotify/fsnotify) - Cross-platform file system notifications\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fairframesio%2Fdata-archiver","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fairframesio%2Fdata-archiver","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fairframesio%2Fdata-archiver/lists"}