{"id":13875,"url":"https://github.com/simon987/awesome-datahoarding","name":"awesome-datahoarding","description":"List of data-hoarding related tools","projects_count":125,"last_synced_at":"2026-06-14T05:00:31.136Z","repository":{"id":39619851,"uuid":"153196527","full_name":"simon987/awesome-datahoarding","owner":"simon987","description":"List of data-hoarding related tools","archived":false,"fork":false,"pushed_at":"2023-09-14T08:32:01.000Z","size":122,"stargazers_count":1310,"open_issues_count":7,"forks_count":85,"subscribers_count":50,"default_branch":"master","last_synced_at":"2026-05-28T14:03:31.747Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/simon987.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2018-10-15T23:55:04.000Z","updated_at":"2026-05-26T19:38:14.000Z","dependencies_parsed_at":"2024-01-13T01:34:03.735Z","dependency_job_id":"192515ba-947f-4d74-802f-928e609f4c3d","html_url":"https://github.com/simon987/awesome-datahoarding","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/simon987/awesome-datahoarding","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simon987%2Fawesome-datahoarding","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simon987%2Fawesome-datahoarding/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simon987%2Fawesome-datahoarding/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simon987%2Fawesome-datahoarding/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/simon987","download_url":"https://codeload.github.com/simon987/awesome-datahoarding/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simon987%2Fawesome-datahoarding/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34309655,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-14T02:00:07.365Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"created_at":"2024-01-12T20:23:51.554Z","updated_at":"2026-06-14T05:00:31.136Z","primary_language":null,"list_of_lists":false,"displayable":true,"categories":["Data recovery","Local Media","Download utilities","Content sharing","Compression","Data curation","Backup","APIs \u0026 Online tools","Hardware / Monitoring","Network","File systems","Utility Scripts","Long-term data archiving","File conversion"],"sub_categories":["Download automation","General","Web Archiving","Application-specific"],"readme":"# Awesome-DataHoarding\n\n[![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)\n\nNote: This is only a first draft/brainstorm. I will try to organize the list with more useful sections in the future    \nFeel free to contribute!\n\n* [Download utilities](#download-utilities)\n  * [Web Archiving](#web-archiving)\n  * [General](#general)\n  * [Application-specific](#application-specific)\n  * [Download automation](#download-automation)\n* [Backup](#backup)\n* [Compression](#compression)\n* [Network](#network)\n* [File systems](#file-systems)\n* [File conversion](#file-conversion)\n* [Utility Scripts](#utility-scripts)\n* [Content sharing](#content-sharing)\n* [Data curation](#data-curation)\n* [APIs \u0026 Online tools](#apis--online-tools)\n* [Hardware / Monitoring](#hardware--monitoring)\n* [Data recovery](#data-recovery)\n* [Local Media](#local-media)\n* [Long-term data archiving](#long-term-data-archiving)\n\n## Download utilities\n\n**[`^        back to top        ^`](#)**\n\n### Web Archiving\n* [ArchiveBox](https://github.com/pirate/ArchiveBox): The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...\n* [Browsertrix Crawler](https://github.com/webrecorder/browsertrix-crawler): Browsertrix Crawler is a simplified (Chrome) browser-based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container  \n* [Collect](https://github.com/xarantolus/Collect): A server to collect \u0026 archive websites that also supports video downloads\n* [grab-site](https://github.com/ludios/grab-site): The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns\n* [Heritrix](https://github.com/internetarchive/heritrix3): Extensible, web-scale, archival-quality web crawler\n* [HTTrack](https://www.httrack.com/): Download a website from the Internet to a local directory\n* [wail](https://github.com/machawk1/wail): Web Archiving Integration Layer: One-Click User Instigated Preservation\n* [webrecorder](https://github.com/webrecorder/webrecorder): An integrated platform for creating high-fidelity, ISO-compliant web archives in a user-friendly interface, providing access to archived content, and sharing collections\n* [wikiteam](https://github.com/WikiTeam/wikiteam): set of tools for archiving wikis\n\n### General\n* [annie](https://github.com/iawia002/annie): YouTube-DL alternative written in Golang\n* [aria2](https://aria2.github.io/): A lightweight multi-protocol \u0026 multi-source command-line download utility\n* [CrowLeer](https://github.com/ERap320/CrowLeer): Powerful C++ web crawler based on libcurl\n* [curl](https://github.com/curl/curl): Tool and library for transferring data with URL syntax, supporting many protocols\n* [Horahora](https://github.com/horahoradev/horahora): Video hosting website and video archival manager for Niconico, Bilibili, and YouTube\n* [httpie](https://httpie.org/): A tool similar to curl and wget but designed to be user friendly, useful for web scraping with shell scripts but be aware you're adding a dependency by doing so.\n* [news-crawl](https://github.com/commoncrawl/news-crawl): Cralwer for news feeds based on StromCrawler that prouduces WARC files\n* [Plowshare](https://github.com/mcrapet/plowshare): Command-line tool to manage file-sharing site\n* [Rclone](https://github.com/ncw/rclone): A command line program to sync files and directories to and from various cloud storage providers\n* [rsync](https://rsync.samba.org/): An open source utility that provides fast incremental file transfer\n* [Suck-It](https://github.com/skallwar/suckit): Recursively visit and download a website's content to your disk (multi-threaded)\n* [wget](https://savannah.gnu.org/git/?group=wget): Utility for non-interactive download of files from the Web\n* [wget2](https://gitlab.com/gnuwget/wget2): Successor of GNU Wget, works multi-threaded\n* [wpull](https://github.com/ArchiveTeam/wpull): Wget-compatible web downloader and crawler\n* [you-get](https://github.com/soimort/you-get): Dumb downloader that scrapes the web\n\n* [ytdl-sub](https://github.com/jmbannon/ytdl-sub): Automate downloading and metadata generation with YouTubeDL\n* [yt-dlp](https://github.com/yt-dlp/yt-dlp): A fork of YT-DLP that behaves better\n### Application-specific\n* [BBCSoundDownloader](https://github.com/FThompson/BBCSoundDownloader): Bulk downloader for BBC's Sound Effects library http://bbcsfx.acropolis.org.uk/\n* [ChanThreadWatch](https://github.com/SuperGouge/ChanThreadWatch): Saves threads from \\*chan-style boards and checks for updates until the thread dies\n* [comics-downloader](https://github.com/Girbons/comics-downloader): Command-line tool to download comicsand manga in pdf/epub/cbz/cbr from supported sites\n* [floatplane_ripper](https://gist.github.com/simon987/0756c378ca2dfb0003931e26ff7fe270): Script to rip all videos from https://floatplane.rip/\n* [gallery-dl](https://github.com/mikf/gallery-dl): Download image galleries and collections from pixiv, exhentai, danbooru and more\n* [Discord-Channel-Scraper](https://github.com/simon987/Discord-Channel-scraper): Discord server archival (json output, download attachments and emojies)\n* [dzi-dl](https://github.com/ryanfb/dzi-dl): Deep Zoom Image Downloader\n* [FanFicFare](https://github.com/JimmXinu/FanFicFare): Tool for making eBooks from stories on fanfiction and other web sites\n* ~~[FicSave](https://github.com/waylaidwanderer/FicSave): Online fanfiction downloader~~ Source code is available, website however is now offline.\n* [flickr_download](https://github.com/beaufour/flickr-download): Simple script to download a Flickr set\n* [Google Images Download](https://github.com/hardikvasa/google-images-download): Python script for downloading images\n* [iiif-dl](https://github.com/ryanfb/iiif-dl): Command-line tile downloader/assembler for IIIF endpoints/manifests\n* [imgbrd-grabber](https://github.com/Bionus/imgbrd-grabber): Very customizable imageboard/booru downloader with powerful filenaming features\n* [instaloader](https://github.com/instaloader/instaloader): Download pictures (or videos) along with their captions and other metadata from Instagram\n* [InstaLooter](https://github.com/althonos/InstaLooter): API-less Instagram pictures and videos downloader.\n* [Instagram Scraper](https://github.com/dankmemes/instagram-scraper): Instagram-scraper is a command-line application written in Python that scrapes and downloads an instagram user's photos and videos. Use responsibly.\n* [PyInstaLive](https://github.com/notcammy/PyInstaLive): Instagram live stream downloader\n* [RedditDownloader](https://github.com/shadowmoose/RedditDownloader): Scrapes Reddit to download media of your choice\n* [Scribd-Downloader](https://github.com/ritiek/scribd-downloader): Allows downloading of Scribd documents\n* [snscrape](https://github.com/JustAnotherArchivist/snscrape): A social networking service scraper in Python\n* [RipMe](https://github.com/RipMeApp/ripme): RipMe is an album ripper for various websites. Runs on your computer. Requires Java 8.\n* [Tube Archivist](https://www.tubearchivist.com/): Self-Hosted Docker container for automated/scheduled YouTube downloads of channels, playlists, etc.\n* [tumblr-utils](https://github.com/bbolli/tumblr-utils): Utilities for dealing with Tumblr blogs, Tumblr backup\n* [yt-mango](https://github.com/terorie/yt-mango): YouTube metadata archiver the Web (HTTP \u0026 FTP)\n* [Youtube-MA](https://github.com/CorentinB/YouTube-MA): YouTube metadata archiver\n\n### Download automation\n* [bazarr](https://github.com/morpheus65535/bazarr): Companion application to Sonarr and Radarr for downloading subtitles\n* [FlexGet](https://github.com/Flexget/Flexget): Multipurpose automation tool for content like torrents, nzbs, podcasts, comics, series, movies, etc.\n* [Jackett](https://github.com/Jackett/Jackett): API support for torrent trackers (works with Sonarr, Radarr and others)\n* [Lidarr](https://github.com/lidarr/Lidarr): Music collection manager for Usenet and BitTorrent users\n* [Mylar](https://github.com/evilhero/mylar): An automated Comic Book downloader (cbr/cbz) for use with SABnzbd, NZBGet and torrents\n* [Sick-Beard](https://github.com/midgetspy/Sick-Beard): PVR for newsgroup users (with limited torrent support)\n* [Radarr](https://github.com/Radarr/Radarr): A fork of Sonarr to work with movies à la Couchpotato\n* [Sonarr](https://github.com/Sonarr/Sonarr): PVR for Usenet and BitTorrent users\n\n## Backup\n\n**[`^        back to top        ^`](#)**\n\n* [BorgBackup](https://www.borgbackup.org/): Deduplicating archiver with compression and encryption\n\n## Compression\n\n**[`^        back to top        ^`](#)**\n\n* [7-Zip](https://www.7-zip.org/): A file archiver with a high compression ratio\n* [KGB Archiver](https://github.com/RandallFlagg/kgbarchiver): compression tool with unbelievable high compression rate\n* [peazip](http://www.peazip.org/): File archiver utility\n* [PIGZ](https://zlib.net/pigz/): Multi-threaded gzip\n* [WinRAR](https://www.rarlab.com/download.htm): Can decompress RAR and zip files\n\n## Network\n\n**[`^        back to top        ^`](#)**\n\n* [NetLimiter](https://www.netlimiter.com/): Internet traffic control and monitoring tool for Windows\n\n## File systems\n\n**[`^        back to top        ^`](#)**\n\n* [httpdirfs](https://github.com/fangfufu/httpdirfs/):  A filesystem which allows you to mount HTTP directory listings\n* [mergerfs](https://github.com/trapexit/mergerfs): a featureful union filesystem\n* [NTFS drivers for MacOS](https://www.seagate.com/ca/en/support/downloads/item/ntfs-driver-for-mac-os-master-dl/)\n\n## File conversion\n\n**[`^        back to top        ^`](#)**\n\n* [AAXtoMP3](https://github.com/KrumpetPirate/AAXtoMP3): convert AAX files to common MP3, M4A, M4B, flac and ogg formats through a basic bash script frontend to FFMPEG\n* [html2warc](https://github.com/steffenfritz/html2warc): Convert web resources to a single warc file\n* [warcat](https://github.com/chfoo/warcat): Tool and library for handling Web ARChive (WARC) files\n\n## Utility Scripts\n\n**[`^        back to top        ^`](#)**\n\n* [Backblaze B2 sync backup script](https://gist.github.com/AlexanderProd/cb645cf858fd5c89780e7df267226b80): Script to sync mutliple directories with Backblaze B2\n* [flac2mp3_V0.py ](https://gist.github.com/simon987/2a1dd3090a2ad0574c00e171670b1e0d): Multi-threaded python script to convert all flac files to mp3 V0 while keeping the directory structure\n* [Misc download scripts](https://github.com/simon987/Misc-Download-Scripts): Scripts for downloading content from various websites\n* [TheFrenchGhosty's Ultimate YouTube-DL Scripts Collection](https://github.com/TheFrenchGhosty/TheFrenchGhostys-Ultimate-YouTube-DL-Scripts-Collection): Collection of YouTube-dl scripts to aid in YouTube channel archival\n* [rclone_dirsize](https://gist.github.com/simon987/7aff5ca3e9ae6c755055ca7b350ef9f8): Get size of http directory listing with rclone\n* [rm_empty_subdir](https://gist.github.com/simon987/f5c2cd7602898615ac9bc8c762d9fe1d): Remove empty sub-directories on Windows\n* [void-cat-uploader](https://github.com/takky1154/void-cat-uploader): This script automatically uploads all files inside a directory to https://void.cat\n* [youtube-dl_soundcloud](https://gist.github.com/simon987/2dd7c57d65a741c93f5791bc984b97d1): Snippet for using YouTube-dl to download soundcloud playlists\n\n## Content sharing\n\n**[`^        back to top        ^`](#)**\n\n* [h5ai](https://github.com/lrsjng/h5ai): HTTP web server index for Apache httpd, lighttpd, nginx and Cherokee\n* [ipfs](https://ipfs.io/): Protocol and network designed to create a content-addressable, peer-to-peer method of storing and sharing hypermedia in a distributed file system\n* [opds](https://opds.io/): Easy to use, Open \u0026 Decentralized Content Distribution\n* [Syncthing](https://syncthing.net/): An application that lets you synchronize your files across multiple devices\n\n## Data curation\n\n**[`^        back to top        ^`](#)**\n\n* [baobab](https://github.com/GNOME/baobab): Graphical disk usage analyzer\n* [beets](https://github.com/beetbox/beets): Music library manager and MusicBrainz tagger\n* [browsemonkey](https://github.com/shukriadams/browsemonkey): Takes snapshots of file systems for offline browsing and searching.\n* [Calibre](https://github.com/kovidgoyal/calibre): Ebook manager\n* [DataCurator-Filetree](https://github.com/roboyoshi/datacurator-filetree): A unified filetree for all kinds of data, which should help in storing, categorising and retrieving\n* [DeepSort](https://github.com/CorentinB/DeepSort/): AI powered image tagger backed by DeepDetect\n* [diskover](https://github.com/diskoverdata/diskover-community): File system crawler, disk space usage, file search engine and file system analytics powered by Elasticsearch\n* [Everything](https://www.voidtools.com/): Locate files and folders by name instantly (Windows)\n* [FileBot](https://www.filebot.net/): FileBot is the ultimate tool for organizing and renaming your Movies, TV Shows and Anime\n* [fucking-weeb](https://github.com/cosarara/fucking-weeb): A library manager for animu (and TV shows, and whatever).\n* [grepWin](https://github.com/stefankueng/grepWin): A powerful and fast search tool using regular expressions (Windows)\n* [Hydrus](https://github.com/hydrusnetwork/hydrus): A desktop application for large media collections\n* [Kiwix](https://www.kiwix.org): An offline reader for online content like Wikipedia, Project Gutenberg, or TED Talks\n* [jdupes](https://github.com/jbruchon/jdupes): Powerful duplicate file finder\n* [MediaElch](https://github.com/komet/mediaelch): Media manager for Kodi\n* [MediaInfo](https://github.com/MediaArea/MediaInfo): Convenient unified display of the most relevant technical and tag data for video and audio files\n* [Mp3tag](https://www.mp3tag.de): Powerful and easy-to-use tool to edit metadata of audio files (Windows/Mac)\n* [phockup](https://github.com/ivandokov/phockup): Media sorting tool to organize photos and videos from your camera\n* [picard](https://github.com/metabrainz/picard): MusicBrainz tagger\n* [TeraCopy](https://www.codesector.com/downloads): Copy your files faster and more securely\n* [tree](http://mama.indstate.edu/users/ice/tree/): 'tree' command for linux\n* [WinDirStat](https://windirstat.net/): Disk usage statistics viewer and cleanup tool for Windows\n* [WizTree](https://antibody-software.com/web/software/software/wiztree-finds-the-files-and-folders-using-the-most-disk-space-on-your-hard-drive/): Finds the files and folders using the most disk space on your hard drive\n* [sist2](https://github.com/simon987/sist2/): Lightning-fast file system indexer and search tool\n* [SyncToy](https://www.microsoft.com/en-us/download/details.aspx?id=15155): Microsoft windows file parity across locations tool\n* [VisiPics](http://www.visipics.info/index.php?title=Main_Page): Automatically finds duplicated images\n\n## APIs \u0026 Online tools\n\n**[`^        back to top        ^`](#)**\n\n* [iqdb](https://iqdb.org/): Multi-service reverse image search\n* [thetvdb](https://www.thetvdb.com/): TV shows metadata (used by plex)\n\n## Hardware / Monitoring\n\n**[`^        back to top        ^`](#)**\n\n* [CrystalDiskInfo](https://crystalmark.info/en/software/crystaldiskinfo/): A HDD/SSD utility software which supports a part of USB, Intel RAID and NVMe\n* [GSmartControl](https://gsmartcontrol.shaduri.dev/): Easy to use Multi-OS S.M.A.R.T. utility with an easy to understand graphical interface\n* [Hard Drive Sentinel](https://www.hdsentinel.com/): Multi-OS SSD and HDD monitoring and analysis software\n* [smartmontools](https://www.smartmontools.org/): Control and monitor storage systems using the (SMART) built into most modern ATA/SATA, SCSI/SAS and NVMe disks\n\n## Data recovery\n\n**[`^        back to top        ^`](#)**\n\n* [PhotoRec](https://www.cgsecurity.org/wiki/PhotoRec) FOSS powerful gui data recovery tool\n* [TestDisk](https://www.cgsecurity.org/wiki/TestDisk_Download) Another FOSS tool by the author of PhotoRec, but this one is for cli\n\n## Local Media\n\n**[`^        back to top        ^`](#)**\n\n* [whipper](https://github.com/whipper-team/whipper): Python CD-DA ripper preferring accuracy over speed. Generates .flac, .cue, and .log by default and automatically fetches metadata from musicbrainz. EAC log plugin is available.\n* [Exact Audio Copy](http://www.exactaudiocopy.de/): A freeware, Windows only application similar to the above that doesn't automatically fetch metadata by default, but EAC rips are preferred by most trackers\n* [MakeMKV](https://www.makemkv.com/): A cross-platform DVD ripper that supports recent blu ray DVDs. It's mostly open source, but the blu ray secret sauce is still hidden\n* [Handbrake](https://handbrake.fr/): Open source DVD ripper and media trascoder. Has more options and features than the above, but it cannot rip blu ray discs\n\n## Long-term data archiving\n\n**[`^        back to top        ^`](#)**\n\n* [CommonCrawl](http://commoncrawl.org/the-data/get-started/): Data collected over seven years (ongoing) which contains web page data, extracted metadata and text extractions.\n* [Blockyarchive](https://github.com/darrenldl/blockyarchive): Archive with forward error correction and sector level recoverability\n* [par2cmdline](https://github.com/Parchive/par2cmdline): A PAR 2.0 compatible file verification and repair tool\n","projects_url":"https://awesome.ecosyste.ms/api/v1/lists/simon987%2Fawesome-datahoarding/projects"}