{"id":20418899,"url":"https://github.com/sehugg/cupaloy","last_synced_at":"2025-08-18T10:04:38.108Z","repository":{"id":139691768,"uuid":"149189630","full_name":"sehugg/cupaloy","owner":"sehugg","description":"personal archive tool for replication tracking / obsolescence reporting","archived":false,"fork":false,"pushed_at":"2019-01-03T20:53:00.000Z","size":10598,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-05T04:17:32.469Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sehugg.png","metadata":{"files":{"readme":"readme.txt","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-09-17T21:21:58.000Z","updated_at":"2020-10-05T09:36:01.000Z","dependencies_parsed_at":null,"dependency_job_id":"b4a422f6-071a-4c22-85b5-27360af6a52c","html_url":"https://github.com/sehugg/cupaloy","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/sehugg/cupaloy","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sehugg%2Fcupaloy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sehugg%2Fcupaloy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sehugg%2Fcupaloy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sehugg%2Fcupaloy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sehugg","download_url":"https://codeload.github.com/sehugg/cupaloy/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sehugg%2Fcupaloy/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270975260,"owners_count":24678270,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-18T02:00:08.743Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-15T06:35:15.072Z","updated_at":"2025-08-18T10:04:38.080Z","avatar_url":"https://github.com/sehugg.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\nGOALS\n\nCupaloy aids humans in the long-term preservation of their digital data by\ntargeting three areas:\n\n- Replication\n- Change tracking\n- Data format analysis\n\nReplication: Since most forms of storage are subject to unforeseen events\nleading to data loss, it is best to have more than one copy stored in more\nthan one location.  Cupaloy scans your files in each location and gives you\na simple readout of how many files are up-to-date across all locations.  You\ncan configure thresholds to warn you when files are not sufficiently\nreplicated, or if too much time has passed between scans.\n\nChange tracking: Since all modifications to your file collections may not\nnecessarily be intentional, Cupaloy helps you understand when files in your\ncollections have changed.  It will give you a simple summary of what has\nchanged, and asks you to confirm these changes.  For example, you might\nconfigure a collection so that additions are automatically accepted, but\ndeletions and modifications need to be confirmed.\n\nData format analysis: Data formats wax and wane over time, and sometimes are\nmore complicated than just a file extension -- video formats, for example,\nare notoriously complicated. Cupaloy uses helper applications to analyze and\nverify the format of files, and to peek inside of archives. The goal is to\npredict when file formats start to become obsolete, and notify you so you\ncan migrate to a different format before it becomes difficult to do so.\n\nCupaloy is designed for end users, not professional curators.  We try to use\nterminology the average user will understand, and sensible defaults wherever\npossible.  Set-it-and-forget-it is the goal.\n\n\n-----\n\ncupaloy init ~/archive\ncupaloy scan ~/archive\ncupaloy list volumes\ncupaloy list collections\ncupaloy status\n\n3 collections.\n1 (Huggs Archive) is online, replicated twice locally and once remotely.\n1 (PuzzlingPlans) is online, not replicated.\n1 (AncientStuff) is offline, last seen 30 days ago.\n\n\nNAME\t\tReplicas\tWhere?\t\tCoverage\tLast check\n---\nHuggs Archive\t2\t\tlocal/remote\t100%\t\t2 days ago\n\n\nDEFINITIONS\n\nA Collection is a set of files found at one or more Locations, meant to be\nbacked up/mirrored.  A Collection has a UUID, which is either randomly\nassigned, directly assigned by user, or derived from a Location.  It also\nhas a name.\n\nA Location is a directory on a filesystem or a URL which refers to a set of\nfiles.  The URL can either contain the UUID for a mounted volume or a\nnetwork location.  Paths are relative to the mounted volume or network\nlocation root.  A Location can also be assigned a name which can be used as\nan alias.\n\nA Snapshot is a scan of a Collection at a given Location. A Snapshot\ncontains details about the scan results, scan duration, and scan options. \nIt also records the last known names of the Collection and Location, and the\nhost name performing the scan.\n\nA Snapshot can have real and virtual folders. Real files are directly\ncontained. Virtual files are indirectly contained in archive files or\nother file containers.\n\nFiles are located in folders, and have a folder path and filename. Files\nhave a size and last known modification timestamp.  Files may also have a\nhash checksum, split into two parts (first 128 and last 384 bits of the\nSHA512 digest)\n\nA Collection or Location can have one or more tags. Tags can be used to\nselect collections and/or locations.\n\n\nSTATUS\n\nCheck\n- file/size match coverage\n- hash match coverage\n- timestamp match coverage\n- last check for each replica\n- replica count\n- replica mixture\n- # errors\n- file access bits for online storage\n- media type\n- file format\n\nScore/identifier for each check\n\n\nNODES\n\n\nFILES\n\n$ARCHIVE/.cupaloy/collection.json\n~/.cupaloy/$NODE/$UUID.db\n~/.cupaloy/collections.db\n\n\nVOLUMES\n\nIdentified by UUID\nFriendly name\n\n\nCOLLECTIONS\n\nread/write: UUID generated by init\nread-only: UUID generated by metadata checksum/volume label/url\n\nfriendly name\n\nlocation\n- url\n- node/path\n\n\nTAGS\n\nonline/offline vs permanent/transient/removable\nhd/ssd/tape/cloud/etc\ndynamic/static\nlocation tags? work/home/etc\n\nshould inform longevity, risk, etc\n\n\nSCHEMA\n\nfiles\nfolders\nhashes\nscans\nlocations\ntags\n\n\n(scannode,uuid,url,volume?)\n\n\nPROTECTION\n\nobfuscate path, name, size, etc?\n- can still parse dir structure\narchive metadata can itself be archived\n\n\nUSABILITY\n\nlook at file signatures/formats\nparse files (archives for instance)\nrun tests (make tests for instance)\n\n\nHIERARCHY\n\nCan have workspace containing git repos, match many-to-many collections\n\n\nBROADCAST\n\nchallenge/response\n- quickly find relevant collections\n- quickly find relevant files\n- quickly check checksums\n- quickly find differences\n\n\nISSUES\n\ntimestamp, uuid differ on exfat from osx/linux\nneed to identify node in location\n- better short id for collectionlocation\nwhat if computer name changes?\nwhat if file path changes?\n- when does config file have to be checked?\n- what if removable drive and files are not at same path?\n- what if mount changes?\nhow to retire a collection/location?\nwhat char. set for archives?\nhttps://docs.python.org/2/library/argparse.html#module-argparse\nhttp://stackoverflow.com/questions/10410180/sqlite-view-across-multiple-databases\nwhat if one location has real files and other in archives?\nclear memory db after every op?\nfilter by collection and file name/metadata\nreject dups that are on same media\nwarn/ignore if file has wrong extension (.zip)\nestimate progress from last scan\nreal/virtual duplicate checking combinations\nprogress for scanning inside archives\n4 GB zip files?\ninline include/excludes\nprettier/more accurate progress, integrate w/ logging, ansi/vt-aware\nidentify file characteristics/archive contents from hash code\ncan only rename directory-based collection\nsentinel files for DropBox/other sync tools?\n- note that sync tools can mess up across multiple computers...\n- parse .dropbox file (same across shares)\nfilter 'dups' results?\nbetter unicode solution (http://stackoverflow.com/questions/492483/setting-the-correct-encoding-when-piping-stdout-in-python)\nNTFS has no UUID for NTFS volumes (http://stackoverflow.com/questions/17612596/not-getting-uuid-from-diskutil-on-osx)\n- special read-only file for disk UUID if not available?\n\"collection@location\" syntax\nbetter includes/excludes file vs dir vs archive\nfind .cupaloy directories on a backup drive\nhow to handle file path changes with --archives?\ntype -\u003e site -\u003e host -\u003e drive -\u003e volume hierarchy?\n- save into volumes table\n- tell when volumes are on the same drive\nfail when no files scanned\nhave to confirm file modifications and/or deletions, ok if -\u003e archive\nrename rewrites site db \ncase sensitivity for uuids/urls\nuse collection root or .cupaloy dir for scan\ntime zone GMT\nwhen file size is too big for SQLite INT\nprint path when multiple archives on same host\nmultiple archive names collide? (shouldn't need to --uuid when creating)\ns3: read from Glacier\n\nfind /home/huggvey/.cupaloy/collections -name '*.db' -exec sqlite3 \\{\\} \".read upgrade2.sql\" \\;\n\n\nWITNESSES\n\nscan file/directory metadata quickly\nuse other directories? (backup, cloud, syncthing...)\nhash metadata for file and directory\nsample file data as \"proof\"\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsehugg%2Fcupaloy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsehugg%2Fcupaloy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsehugg%2Fcupaloy/lists"}