{"id":20425257,"url":"https://github.com/catseye/yastasoti","last_synced_at":"2026-04-19T01:37:42.205Z","repository":{"id":142240032,"uuid":"148782879","full_name":"catseye/yastasoti","owner":"catseye","description":"MIRROR of https://codeberg.org/catseye/yastasoti : Yet another script to archive stuff off teh internets","archived":false,"fork":false,"pushed_at":"2024-05-21T05:30:16.000Z","size":67,"stargazers_count":6,"open_issues_count":2,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-12-09T13:03:38.101Z","etag":null,"topics":["archiver","archiving","backup-files","curation","downloading","no-more-404","save-the-internet"],"latest_commit_sha":null,"homepage":"https://catseye.tc/node/yastasoti","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"unlicense","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/catseye.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-09-14T12:03:54.000Z","updated_at":"2025-07-28T02:02:51.000Z","dependencies_parsed_at":"2025-01-15T15:08:38.308Z","dependency_job_id":"bb31683a-3c61-46f8-8a69-2ac6c67332b4","html_url":"https://github.com/catseye/yastasoti","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/catseye/yastasoti","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/catseye%2Fyastasoti","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/catseye%2Fyastasoti/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/catseye%2Fyastasoti/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/catseye%2Fyastasoti/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/catseye","download_url":"https://codeload.github.com/catseye/yastasoti/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/catseye%2Fyastasoti/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31991720,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-18T20:23:30.271Z","status":"ssl_error","status_checked_at":"2026-04-18T20:23:29.375Z","response_time":103,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["archiver","archiving","backup-files","curation","downloading","no-more-404","save-the-internet"],"created_at":"2024-11-15T07:12:42.584Z","updated_at":"2026-04-19T01:37:42.158Z","avatar_url":"https://github.com/catseye.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"`yastasoti`\n===========\n\n_Version 0.4_\n| _Entry_ [@ catseye.tc](https://catseye.tc/node/yastasoti)\n| _See also:_ [ellsync](https://codeberg.org/catseye/ellsync#ellsync)\n∘ [tagfarm](https://codeberg.org/catseye/tagfarm#tagfarm)\n∘ [shelf](https://codeberg.org/catseye/shelf#shelf)\n\n- - - -\n\n\u003cimg align=\"right\" src=\"images/yastasoti-logo.png?raw=true\" /\u003e\n\nYet another script to archive stuff off teh internets.\n\nIt's not a spider that automatically crawls previously undiscovered webpages — it's intended\nto be run by a human to make backups of resources they have already seen and recorded the URLs of.\n\nIt was split off from [Feedmark][], which doesn't itself need to support this function.\n\n### Features ###\n\n*   input is a JSON list of objects containing links (such as those produced by Feedmark)\n*   output is a JSON list of objects that could not be retrieved, which can be fed back\n    into the script as input\n*   checks links with `HEAD` requests by default.  `--archive-to` causes each link to be\n    fetched with `GET` and saved to the specified directory.  `--archive-via` specifies an\n    _archive router_ which causes each link to be fetched, and saved to a directory\n    which is selected based on the URL of the link.\n*   tries to be idempotent and not create a new local file if the remote file hasn't changed\n*   handles links that are local files; checks if the file exists locally\n*   can log its actions verbosely to a specified logfile\n*   source code is a single, public-domain file with a single dependency (`requests`)\n\n### Examples ###\n\n#### Check all links in a set of Feedmark documents ####\n\n    feedmark --output-links article/*.md | yastasoti --extant-path=article/ - | tee results.json\n\nThis will make only `HEAD` requests to check that the resources exist.\nIt will not fetch them.  The ones that could not be fetches will appear\nin `results.json`, and you can run yastasoti on that again to re-try:\n\n    yastasoti --extant-path=article/ results.json | tee results2.json\n\n#### Archive stuff off teh internets ####\n\n    cat \u003elinks.json \u003c\u003c EOF\n    [\n        {\n            \"url\": \"http://catseye.tc/\"\n        }\n    ]\n    EOF\n    yastasoti --archive-to=downloads links.json\n\n#### Override the filename the stuff is archived as ####\n\nBy default, the subdirectory and filename to which the stuff is archived are\nbased on the site's domain name and the stuff's path.  The filename, however,\ncan be overridden if the input JSON contains a `dest_filename` field.\n\n    cat \u003elinks.json \u003c\u003c EOF\n    [\n        {\n            \"url\": \"http://catseye.tc/\",\n            \"dest_filename\": \"home_page.html\"\n        }\n    ]\n    EOF\n    yastasoti --archive-to=downloads links.json\n\n#### Categorize archived materials with a router ####\n\nAn archive router (used with `--archive-via`) is a JSON file that looks like this:\n\n    {\n        \"http://catseye.tc/*\": \"/dev/null\",\n        \"https://footu.be/*\": \"footube/\",\n        \"*\": \"archive/\"\n    }\n\nIf a URL matches more than one pattern, the longest pattern will be selected.\nIf the destination is `/dev/null` it will be treated specially — the file will\nnot be retrieved at all.  If no pattern matches, an error will be raised.\n\nTo use an archive router once it has been written:\n\n    yastasoti --archive-via=router.json links.json\n\n### Requirements ###\n\nTested under Python 2.7.12.  Seems to work under Python 3.5.2 as well,\nbut this is not so official.\n\nRequires `requests` Python library to make network requests.  Tested\nwith `requests` version 2.21.0.\n\nIf `tqdm` Python library is installed, will display a nice progress bar.\n\n(Or, if you would like to use Docker, you can pull a Docker image from\n[catseye/yastasoti on Docker Hub](https://hub.docker.com/r/catseye/yastasoti),\nfollowing the instructions given on that page.)\n\n### TODO ####\n\n*   Archive youtube links with youtube-dl.\n*   Handle failures (redirects, etc) better (detect 503 / \"connection refused\" better.)\n*   Allow use of an external tool like `wget` or `curl` to do fetching.\n\n[Feedmark]: http://catseye.tc/node/Feedmark\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcatseye%2Fyastasoti","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcatseye%2Fyastasoti","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcatseye%2Fyastasoti/lists"}