https://github.com/sehugg/cupaloy
personal archive tool for replication tracking / obsolescence reporting
https://github.com/sehugg/cupaloy
Last synced: 10 months ago
JSON representation
personal archive tool for replication tracking / obsolescence reporting
- Host: GitHub
- URL: https://github.com/sehugg/cupaloy
- Owner: sehugg
- Created: 2018-09-17T21:21:58.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2019-01-03T20:53:00.000Z (over 7 years ago)
- Last Synced: 2025-03-05T04:17:32.469Z (over 1 year ago)
- Language: Python
- Size: 10.1 MB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.txt
Awesome Lists containing this project
README
GOALS
Cupaloy aids humans in the long-term preservation of their digital data by
targeting three areas:
- Replication
- Change tracking
- Data format analysis
Replication: Since most forms of storage are subject to unforeseen events
leading to data loss, it is best to have more than one copy stored in more
than one location. Cupaloy scans your files in each location and gives you
a simple readout of how many files are up-to-date across all locations. You
can configure thresholds to warn you when files are not sufficiently
replicated, or if too much time has passed between scans.
Change tracking: Since all modifications to your file collections may not
necessarily be intentional, Cupaloy helps you understand when files in your
collections have changed. It will give you a simple summary of what has
changed, and asks you to confirm these changes. For example, you might
configure a collection so that additions are automatically accepted, but
deletions and modifications need to be confirmed.
Data format analysis: Data formats wax and wane over time, and sometimes are
more complicated than just a file extension -- video formats, for example,
are notoriously complicated. Cupaloy uses helper applications to analyze and
verify the format of files, and to peek inside of archives. The goal is to
predict when file formats start to become obsolete, and notify you so you
can migrate to a different format before it becomes difficult to do so.
Cupaloy is designed for end users, not professional curators. We try to use
terminology the average user will understand, and sensible defaults wherever
possible. Set-it-and-forget-it is the goal.
-----
cupaloy init ~/archive
cupaloy scan ~/archive
cupaloy list volumes
cupaloy list collections
cupaloy status
3 collections.
1 (Huggs Archive) is online, replicated twice locally and once remotely.
1 (PuzzlingPlans) is online, not replicated.
1 (AncientStuff) is offline, last seen 30 days ago.
NAME Replicas Where? Coverage Last check
---
Huggs Archive 2 local/remote 100% 2 days ago
DEFINITIONS
A Collection is a set of files found at one or more Locations, meant to be
backed up/mirrored. A Collection has a UUID, which is either randomly
assigned, directly assigned by user, or derived from a Location. It also
has a name.
A Location is a directory on a filesystem or a URL which refers to a set of
files. The URL can either contain the UUID for a mounted volume or a
network location. Paths are relative to the mounted volume or network
location root. A Location can also be assigned a name which can be used as
an alias.
A Snapshot is a scan of a Collection at a given Location. A Snapshot
contains details about the scan results, scan duration, and scan options.
It also records the last known names of the Collection and Location, and the
host name performing the scan.
A Snapshot can have real and virtual folders. Real files are directly
contained. Virtual files are indirectly contained in archive files or
other file containers.
Files are located in folders, and have a folder path and filename. Files
have a size and last known modification timestamp. Files may also have a
hash checksum, split into two parts (first 128 and last 384 bits of the
SHA512 digest)
A Collection or Location can have one or more tags. Tags can be used to
select collections and/or locations.
STATUS
Check
- file/size match coverage
- hash match coverage
- timestamp match coverage
- last check for each replica
- replica count
- replica mixture
- # errors
- file access bits for online storage
- media type
- file format
Score/identifier for each check
NODES
FILES
$ARCHIVE/.cupaloy/collection.json
~/.cupaloy/$NODE/$UUID.db
~/.cupaloy/collections.db
VOLUMES
Identified by UUID
Friendly name
COLLECTIONS
read/write: UUID generated by init
read-only: UUID generated by metadata checksum/volume label/url
friendly name
location
- url
- node/path
TAGS
online/offline vs permanent/transient/removable
hd/ssd/tape/cloud/etc
dynamic/static
location tags? work/home/etc
should inform longevity, risk, etc
SCHEMA
files
folders
hashes
scans
locations
tags
(scannode,uuid,url,volume?)
PROTECTION
obfuscate path, name, size, etc?
- can still parse dir structure
archive metadata can itself be archived
USABILITY
look at file signatures/formats
parse files (archives for instance)
run tests (make tests for instance)
HIERARCHY
Can have workspace containing git repos, match many-to-many collections
BROADCAST
challenge/response
- quickly find relevant collections
- quickly find relevant files
- quickly check checksums
- quickly find differences
ISSUES
timestamp, uuid differ on exfat from osx/linux
need to identify node in location
- better short id for collectionlocation
what if computer name changes?
what if file path changes?
- when does config file have to be checked?
- what if removable drive and files are not at same path?
- what if mount changes?
how to retire a collection/location?
what char. set for archives?
https://docs.python.org/2/library/argparse.html#module-argparse
http://stackoverflow.com/questions/10410180/sqlite-view-across-multiple-databases
what if one location has real files and other in archives?
clear memory db after every op?
filter by collection and file name/metadata
reject dups that are on same media
warn/ignore if file has wrong extension (.zip)
estimate progress from last scan
real/virtual duplicate checking combinations
progress for scanning inside archives
4 GB zip files?
inline include/excludes
prettier/more accurate progress, integrate w/ logging, ansi/vt-aware
identify file characteristics/archive contents from hash code
can only rename directory-based collection
sentinel files for DropBox/other sync tools?
- note that sync tools can mess up across multiple computers...
- parse .dropbox file (same across shares)
filter 'dups' results?
better unicode solution (http://stackoverflow.com/questions/492483/setting-the-correct-encoding-when-piping-stdout-in-python)
NTFS has no UUID for NTFS volumes (http://stackoverflow.com/questions/17612596/not-getting-uuid-from-diskutil-on-osx)
- special read-only file for disk UUID if not available?
"collection@location" syntax
better includes/excludes file vs dir vs archive
find .cupaloy directories on a backup drive
how to handle file path changes with --archives?
type -> site -> host -> drive -> volume hierarchy?
- save into volumes table
- tell when volumes are on the same drive
fail when no files scanned
have to confirm file modifications and/or deletions, ok if -> archive
rename rewrites site db
case sensitivity for uuids/urls
use collection root or .cupaloy dir for scan
time zone GMT
when file size is too big for SQLite INT
print path when multiple archives on same host
multiple archive names collide? (shouldn't need to --uuid when creating)
s3: read from Glacier
find /home/huggvey/.cupaloy/collections -name '*.db' -exec sqlite3 \{\} ".read upgrade2.sql" \;
WITNESSES
scan file/directory metadata quickly
use other directories? (backup, cloud, syncthing...)
hash metadata for file and directory
sample file data as "proof"