{"id":22625253,"url":"https://github.com/linas/archeo","last_synced_at":"2025-03-29T03:20:36.975Z","repository":{"id":267085041,"uuid":"900158922","full_name":"linas/archeo","owner":"linas","description":"File Recovery, Integrity and Archive Management","archived":false,"fork":false,"pushed_at":"2025-01-03T04:54:41.000Z","size":282,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-03T13:14:41.807Z","etag":null,"topics":["corruption","data","monitoring","recovery"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/linas.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-08T02:40:45.000Z","updated_at":"2025-01-03T04:54:45.000Z","dependencies_parsed_at":null,"dependency_job_id":"eeee50d0-4c4f-4867-9870-ebf486111f1f","html_url":"https://github.com/linas/archeo","commit_stats":null,"previous_names":["linas/archeo"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linas%2Farcheo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linas%2Farcheo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linas%2Farcheo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linas%2Farcheo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/linas","download_url":"https://codeload.github.com/linas/archeo/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246131515,"owners_count":20728334,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["corruption","data","monitoring","recovery"],"created_at":"2024-12-09T00:20:15.681Z","updated_at":"2025-03-29T03:20:36.950Z","avatar_url":"https://github.com/linas.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\nArcheo -- Data Recovery\n=======================\nFinding and repairing lost, corrupted, damaged data. The Archivist's\nFriend.  Data Hoarders Welcome.  Data Archeology. Forensics. Longevity.\nA Unified View of File System Data.\n\nA Humble Start\n--------------\nFifteen years ago, I copied 3376 MP3 files from one computer to another.\nWhere they sat, untouched, all this time. Today, I noticed that 683 of\nthese files differ. I looked at a couple. One was 2146766 bytes long.\nThe other was 2147789 bytes long. A difference of a thousand-something\nbytes, but you know, identical files are supposed to be identical. Not\ndifferent. Maybe one is corrupted? But which one?\n\nThey both load just fine into `audacity`. They both play, just fine,\ntill the end. One had a tiny squeal, lasting a heartbeat. Barely. Both\nfiles have fairly long stretches of zero bytes: file data which is zero,\nfor a bunch of bytes in a row. Is this normal?\n\nThis ain't good. I have maybe 1.5 million files. Not sure, haven't\nfinished counting. Wife \u0026 kids have more. I panicked. My panic is\njustified. Out of those 1.5 million files, 1440 of them consist of\nnothing but zeros (they shouldn't; they're photos, tar files...)\n\nA quick search on the net indicates that ... what the heck, everybody\nand their kid brother have this problem. More quick searching indicates\nthat there are lots of mini-tools, dribs \u0026 drabs that repair this and\nthat, usually specialized, maybe command-line, maybe graphical, maybe\nobsolete. There are discussions on reddit, quora and stack exchange.\nThere's even AI-generated hallucinations. But there is no unified data\nrecovery tool. At least for Linux.  At least free, open-source, GPL'ed.\n\nAnd so now I am writing these paragraphs. And I'm thinking of creating\nsoftware to help me with my recovery efforts. And I'd like you to help\nme do this.\n\nGoals\n-----\nProjects must have goals. A scope. A vision. A motivating dream which\ninspires developers and sustains users. So here:\n* Must solve my personal data corruption problem.\n* Must be very easy to use, so I guess graphical, pull-down menu \u0026 all.\n* Must allow plugins and modules for custom repair. There are already\n  tools that fix MP3's, and other tools that fix photos. Use those.\n* Use copies, when available. Figure out if one of the copies is good,\n  and use that. But if there are two broken copies, maybe a single good\n  version can be created by splicing these together.\n* Search my old backups and archives for a good copy.\n* Consolidate all my old archives and copies. They are everywhere,\n  I don't even know what I have, or if its any good, and its all taking\n  up disk space. Where is it? What is it? Is it rotten? Is it good?\n* Provide \"some level of\" content integrity assurances.\n* Start small, for home users. Expand to archives, libraries. Support\n  databases and complex data. Allow data forensics and data recovery.\n  Do things that data archivists need. Handle medical data, business data,\n  science data. Scale to exabytes. Big-Ten accounting firms offer\n  asset tracking for large accounts. IBM offers information management\n  systems for large corporations. How do they deal with this?\n* Build the foundations for AI.\n\nDon't laugh at that last bullet. Yes, you and I both are tired of the\nAI hype and the rather underwhelming results. But I'm being serious, here.\nIf you'll let me, I want to write a short essay about AI and longevity.\n\nThe Process of Living\n---------------------\nLiving organisms heal themselves. A collection of data should be\nself-healing. Living organisms know things and remember things.\nA collection of data should know what's in it, what it's made out\nof. It should know when some blob of data was last seen online,\nor whether its now in cold storage.\n\nThere should be ways of exploring, finding, searching, discovering.\nKnowing who you are, by knowing what you remember. Living organisms\nhave eyes for looking, for seeing. An AI/AGI needs sensory organs,\ntoo, but for \"seeing\" collections of data. For finding and exploring\ndata, for living in a world not of tigers hiding in grass and rocks\nfalling from cliffs, but archives of social media posts.\n\nIs this kind of software useful?\n--------------------------------\nEveryone is moving to the cloud anyway. Photos live on cell phones,\nand are automatically synced to Google's cloud. If you run out of\nstorage, you can buy more for $X/month. This is an easy choice for\nmost users: why futz with a desktop computer, or worse, a Linux desktop,\nwhen everything runs on the cloud? Of course, if you don't pay your\nmonthly fee, your data disappears.\n\nPlan B is to buy a NAS storage box, and keep your stuff there. These\nare expensive, but very very easy for most users: plug 'em in and go.\nIs there secret, silent data corruption on a NAS box? Who knows? Are\nyou keeping backup copies? \"Who needs backup\", you might think, \"I've\ngot RAID.\" This kind of thinking is fine, until \"operator error\" results\nin a deleted file that maybe you really should not have deleted.\n\nSo the idea here is for the FSF purist: someone who wants fine-grained\nfreedoms to control their data, rather then being beholden to some cloud\nprovider who will wipe your data soon as you miss a monthly bill or two.\nSomeone who might be willing to use a proprietary NAS box, but would\nlike some way of double-checking.\n\nThis project is for the finicky and technically sophisticated user\n(hobbyist?) who runs a Linux desktop or three at home, and worries about\ntheir data. Perhaps one day, this project will be useful to archivists\nor librarians. Maybe housewives and pensioners with family photo albums\nand genealogy trees that they are safe-guarding. Perhaps even scientists\nprotecting their data, or business owners, who ... ???. But lets not\nget ahead of ourselves. Let's take a closer look.\n\n\nData types\n-----------\nThe issue, the meta-issue is that google, and many other on-line\nsocial media and networking services (MySpace, anyone?) have a habit\nof shutting down services that are not profitable. If you are lucky,\nyou might get a copy of your data.\n\nMore generally, you can't. When Meta/Facebook ditches you, you do not\nhave the option of downloading all your old posts and photos. Those are\ngone forever. The insult is lack of due process: Facebook is judge, jury\nand hangman. The injury is loss of connection, loss of data.\n\nSome sources are just hard to backup. Chats on Discord. SMS and WhatsApp\nmessages (and photos, videos, sounds) Perhaps valueless for a younger\ncrowd. Perhaps more interesting if its from grandma, or a deceased loved\none.  There's nothing wrong with building a digital shrine for a lost loved\none. This is what love and cherished memories are about.  Perhaps one\nday, the weight of the past will be too much. That is not today.\n\nVersion 0.0.9\n-------------\nBased on a few days of searching the net, I can't find anything even\nvaguely close to what I want. And so, perhaps stupidly, I've started\nwriting a system. I've done this because this is kind of a blocker for\nmy migrating data from here to there, and specifically, from off my RAID\narrays and onto Ceph.\n\nThe current system architecture is minimal, and the implementation\nwas started two weeks ago. A basic filesystem crawler/cataloger has been\nset up, and it works. A web UI has been prototyped.  See the *HOWTO*\nbelow.\n\nThere are two prototypes. The first was written in highly conventional\nSQL plus Python plus Flask for the Web UI. It was easy. It's not\ncomplicated. Any ordinary developer can read and understand this code,\nand hack on it.\n\nThe second is the same as the first, but replaces sqlite3 by the\n[AtomSpace](https://github.com/opencog/atomspace). This was forced by\nthe general systemic shortcomings of SQL: its just not really the\nappropriate tool for this particular job. The AtomSpace is much faster,\nand much easier to use, and much more flexible. However, it took the\nfirst prototype to (re-)discover this. The design decisions that lead\nto this are reviewed in the\n[Similarity README](src/similarity/README.md) file.\n\nThe use ot the AtomSpace allows an AI meta-issue to be explored: how\ndoes an AI systm \"understand\" what it's dealing with, and how does it\neven \"see\" the data that it's working with? The meta-issue is discussed\nin the [Sensory README](atoms/catalog/README-sensory.md).\n\nSystems Survey\n--------------\nThe [Systems Survey](Systems-Survey.md) is a lame attempt to list and\nreview related systems, or systems that could provide tools, or a\nframework, or otherwise be deployed. Anyone out there care to move\nthis page to the project wiki?\n\nQuestions and Ideas\n-------------------\nIdeally, the system envisioned here \"plays nice\" with existing systems.\nPerhaps its a module on existing systems. Perhaps the impleemntation\ncan make use of existing frameworks. How would this work?\n\n* What do archivists and digital librarians do today? If they import\n  a new data set, do they scrub it? How do they track multiple copies\n  of what they have? Short answer: no, they do not. If they do, they\n  do not talk about it, and it is not mentioned on their project\n  websites. The assumption is, I assume, that they can trust their\n  storage systems to not screw up.\n\n* How do backup systems keep track of what's where? When the last backup\n  was made? If the backup is corrupted? What are the existing open source\n  backup systems?\n\n* Intrusion detection systems store hashes of files, and detect corruption\n  based on those hashes. If you don't have hashes of your old data, you are\n  SOL. Are there systems or frameworks for tracking file hashes and other\n  file metadata? Can these be used in data archival systems?\n  Examples include Tripwire and FIM (File Integrity Monitor).\n\n* File explorers and (graphical) file browsers... show files. Do any of\n  them provide a framework for tracking data health? A meta-system for\n  tracking backup copies? Some plugin framework?\n\n* Systems like splunk are designed for admins who need to track error logs\n  for hundreds of machines. Can the splunk dashboard and framework be\n  repurposed for tracking archive health?\n\n* Systems like wireshark can do low-level network packet inspection.\n  Wirsehark includes a packet disasembly and formatting language. This\n  has been used to create thousands of packet disassemblers, since each\n  byte and bit in a packet can be named and labelled. Could this be used\n  to disassemble and repair MP3 files? Tar files? Corrupted git archives?\n\n* Disk drive forensics tools can pull apart corrupted disk images. Is\n  there any kind of generic framework that can be used?\n\n* How does one prevent damage, moving forward, into the future?\n  Clearly, off-the-shelf mdraid+ext4fs plus consumer-grade PC's, disk\n  drives and controllers are inadequate (because that's the setup I used\n  for the last few decades, and now I have data corruption.) Stacked\n  combinations of LVM, Btrfs, XFS are not obviously better. Ceph is a\n  distributed storage system. The very first time I used it, I found data\n  corruption errors. Perhaps Ceph is to blame, perhaps a disk controller\n  is to blame.  Maybe a cosmic ray hit the system during file copy. Ceph\n  is aimed at large clusters, not small users. Fully debugged if you have\n  1000 OSD's on 100 hosts. But not so much if you have 3 OSD's on two hosts.\n  There's no home-user Ceph community. There should be.\n\nDesign requirements\n-------------------\nIn my current modest setup, I need these things:\n\n* Log of which directories were copied from where to where, and when.\n* Directory metadata: how many files? How many bytes?\n* If the original version is still there, does the new copy agree\n  with the old one?\n* Can a compare of new and old be run on some peridoic basis?\n  e.g. once a month? Twice a year? What was the result?\n* Checksums. Compute and store checksums. Compare file contents\n  by checksum.  Find files by checksum.\n* Limit checksum collection to specific file types, similar to how\n  locate, mlocate, plocate and updatedb work.\n* When were these last computed? What was the matching file\n  name? What was the file metadata at that time?\n* Allow file validation plugins. e.g. JHOVE, Apache Tika or DROID\n  can be used to determine if a file passes basic integrity checks.\n\nTech selection\n--------------\n* Should be possible for ordinary coders to modify and extend this\n  project. Thus, python seems like a reasonable choice. Java seems\n  overkill/awkward, and rust not popular enough. (And rust requires\n  compiling).\n* Data has to be kept somewhere. Ideally, configurable, in some\n  database. Postgres, MariaDB and SQLite all seem viable. The first\n  two feel like overkill. SQLite seems small, simple, easy for now.\n* High performance is not a requirement. High usability is.\n\nTech re-selection\n-----------------\nThe prototype version 0.0.6 was written in a very conventional stack\nof sqlite3 for the SQL db, python for the programming language, and\npython flask for the web ui. This stuff is widely used, widely\nunderstood, and quite easy for ordinary developers to get into and use.\n\nThe prototype consists of a crawler that creates a file catalog, and\na web UI that can walk directories and explore the locations of similar\nfiles. It works just fine.\n\nFor version 0.2, I want to add some rather sophisticated similarity\ndetection tools, described in the [similarity README](src/similarity/README.md).\nWhile designing that, I realized that my sqlite3+python+flask stack is\nnot going to cut it, and that I already have much better tools: namely,\nthe [AtomSpace](https://github.com/opencog/atomspace).\n\nThe problem is that almost no one has heard of the AtomSpace, almost\nno one uses it, and its a strange weird beast for ordinary programmers.\nHowever, I've also realized that the number of ordinary programmers who\nare going to join this project is approximately zero. So why should I\ncater to them, these people who will never arriv and assist, anyway,\nwhen there is something way more fun and useful to work with? So I'm\nrestarting this project on the OpenCog AtomSpace.\n\nHOWTO (AtomSpace)\n-----------------\nThe current version AtomSpace+python+flask code is in the\n[`atoms`](atoms) directory.  Refer to the README there for how to\nset up and operate.\n\nThe current implementation \"works\", but has many shortcomings.\nThat is, it \"works\", but is woefully incomplete.\n\nHOWTO (Prototype)\n-----------------\nThe earlier version 0.0.6 sqlite3+python+flask prototype is in the\n[`src`](src) directory. See the README there for HOWTO instructions.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flinas%2Farcheo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flinas%2Farcheo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flinas%2Farcheo/lists"}