https://github.com/all-the-data/awesome-data-hoarding

How to save everything online. Tools for for scraping, saving, downloading, hoarding, archiving, etc.
https://github.com/all-the-data/awesome-data-hoarding

List: awesome-data-hoarding

archiving awesome awesome-list data-hoarder hoarding reddit reddit-downloader

Last synced: 6 months ago
JSON representation

How to save everything online. Tools for for scraping, saving, downloading, hoarding, archiving, etc.

Host: GitHub
URL: https://github.com/all-the-data/awesome-data-hoarding
Owner: all-the-data
License: cc-by-sa-4.0
Created: 2022-02-18T09:44:20.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2024-03-19T21:11:44.000Z (about 1 year ago)
Last Synced: 2024-04-10T15:05:37.880Z (about 1 year ago)
Topics: archiving, awesome, awesome-list, data-hoarder, hoarding, reddit, reddit-downloader
Homepage:
Size: 81.1 KB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

ultimate-awesome - awesome-data-hoarding - How to save everything online. Tools for for scraping, saving, downloading, hoarding, archiving, etc. (Other Lists / Julia Lists)

README

        # awesome-data-hoarding

A concise cheat-sheet of commands and tools for scraping, saving, hoarding, archiving, collecting, organising and browsing data.

Inspired by Reddit's [/r/DataHoarder](https://www.reddit.com/r/DataHoarder/)

## Quick reference

Which archiving tool should you choose for each web service?

- Amazon Video: Unknown. Check torrents instead.

- BBC iPlayer: [youtube-dl](https://youtube-dl.org/) / [yt-dlp](https://github.com/yt-dlp/yt-dlp)

- Discord: DiscordChatExporter (see below for notes)

- Mediawiki website: Native dump using `/wiki/Special:AllPages` and `/wiki/Special:Export`.

- Netflix: Unknown. Check torrents instead.

- Reddit: Various tools

  - Tools to save whole threads

    - [Bulk-Downloader-For-Reddit](https://github.com/aliparlakci/bulk-downloader-for-reddit)

    - [BDFRX](https://github.com/OMEGARAZER/bulk-downloader-for-reddit-x#differences-from-bdfr)

    - [Gallery-DL](https://github.com/mikf/gallery-dl)

    - [RipMe](https://github.com/RipMeApp2/ripme).

  - "Print" method for threads

    - Change `www.reddit.com` to `old.reddit.com` -- all comments will now be expanded

    - Sort by: New

    - Use [cleanly print](https://chromewebstore.google.com/detail/cleanly-print/afloocnncgjhdlacbejppjepboilajdg) chrome extension

    - Click to remove areas, also click to 'tag' areas for printing.

  - Historial data dumps: [the-eye](https://the-eye.eu/redarcs/) / [torrents](https://academictorrents.com/userdetails.php?id=9863)

- SoundCloud: [youtube-dl](https://youtube-dl.org/) / [yt-dlp](https://github.com/yt-dlp/yt-dlp)

- Tumblr: [TumblThreeApp](https://github.com/TumblThreeApp/TumblThree) (Windows). Viewers: [1](https://github.com/jacob-pro/tumbl-three-viewer), [2](https://github.com/willsheppard/random-scripts/blob/master/TumblThree_BackupViewer.html).

- Twitter: [ThreadReaderApp](https://threadreaderapp.com/)

- Torrents: Use [unblockit](https://www.google.com/search?q=unblockit) for a list of torrent sites. Official [Twitter](https://twitter.com/thepirateproxy) / [Reddit](https://www.reddit.com/r/Unblockit/).

- Private torrent trackers: Might contain any TV or movie ever broadcat. It can be difficult to get an invite, and you may need to maintain an upload ratio.

- Individual web pages:

  - Save as | Web Page, HTML Only

  - Save as | Web Page, Single File

  - Save as | Web Page, Complete

  - Print | Save as PDF

  - Chrome extension [SingleFile](https://chromewebstore.google.com/detail/singlefile/mpiodijhokgodhhofbcjdecpffjipkle) <-- Recommended!

- Websites generally: wget, httrack or [ArchiveBot](https://wiki.archiveteam.org/index.php?title=ArchiveBot).

- Youtube video/music: [youtube-dl](https://youtube-dl.org/) (see below for notes) / [yt-dlp](https://github.com/yt-dlp/yt-dlp)

- Radio scrobbling / Music identification: [Shazam](https://chromewebstore.google.com/detail/shazam-find-song-names-fr/mmioliijnhnoblpgimnlajmefafdfilb) or [AHA Music finder](https://chromewebstore.google.com/detail/aha-music-song-finder-for/dpacanjfikmhoddligfbehkpomnbgblf)

## Scraping tools

- Radio scrobbling

  - Play radio station with low quality playlist: [La Mega, Malaga](https://onlineradiobox.com/es/lamegaradio/).

  - Install chrmoe browser extension [Shazam](https://chromewebstore.google.com/detail/shazam-find-song-names-fr/mmioliijnhnoblpgimnlajmefafdfilb) or [AHA Music finder](https://chromewebstore.google.com/detail/aha-music-song-finder-for/dpacanjfikmhoddligfbehkpomnbgblf)

  - On Linux use `xdotool` to automate clicking on chrome browser extension icons to activate music identification: `watch "xdotool mousemove 3442 90 click 1; sleep 20; xdotool mousemove 3476 90 click 1; sleep 20"` (adjust coords as needed)

  - Does not require speakers to be on

Details of precise sets of commands.

- [wget](https://www.gnu.org/software/wget/manual/wget.html) for websites

```

wget \

    -e 'robots=off' \

    --accept '*.*' \

    --mirror \

    --wait 2 \

    --random-wait \

    --convert-links \

    --user-agent 'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.7113.93 Safari/537.36' \

    'http://www.example.com/'

```

- [StreamRipper](http://streamripper.sourceforge.net) for music

    - Example: `streamripper ###URL### -u "FreeAmp/2.x" -q -l 86400`

- [Chrome DevTools](https://developer.chrome.com/docs/devtools) for anything via a web browser

    - [network tab](https://developer.chrome.com/docs/devtools/network/reference)

    - [resources tab](https://developer.chrome.com/docs/devtools/resources)

- Mediawiki for wiki sites

    - For an XML dump containing wikitext...

    - Copy names of pages from `/wiki/Special:AllPages`...

    - Paste into `/wiki/Special:Export`

    - (optional) Parse resulting wikitext with [mwparserfromhell](https://github.com/earwig/mwparserfromhell).

- [youtube-dl](https://yt-dl.org) / [yt-dlp](https://github.com/yt-dlp/yt-dlp) for Youtube and other video/audio

  - Video

```

yt-dlp \

    --ignore-errors \

    --format 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best' \

    --output "%(playlist_title)s/%(title)s.%(ext)s" \

    --throttled-rate 10K \

    ###URL###

```

  - Audio

```

yt-dlp \

    --ignore-errors \

    --extract-audio \

    --audio-quality 0 \

    --audio-format mp3 \

    --prefer-ffmpeg \

    --output "%(playlist_title)s/%(artist)s - %(title)s.%(ext)s" \

    --throttled-rate 10K \

    ###URL###

```

  - Audio album playlist

```

yt-dlp \

    ...etc... \

    --output "%(artist)s - %(album)s/%(artist)s - %(album)s - %(playlist_index)02d - %(track)s.%(ext)s" \

    ###URL###

```

  - Video playlist

```

yt-dlp \

    ...etc... \

    --output "%(playlist_title)s/%(playlist_index)03d - %(artist)s - %(title)s.%(ext)s" \

    ###URL###

```

  - Multiple playlists

```

for URL in $(cat list)

do

    yt-dlp ...etc... "$URL"

done

```

- [DiscordChatExporter](https://github.com/Tyrrrz/DiscordChatExporter) + excellent [wiki](https://github.com/Tyrrrz/DiscordChatExporter/wiki)

    - Example: `docker run --rm -v /var/www/zaphod/adhd:/app/out tyrrrz/discordchatexporter:stable export --channel ###ID### --token ###SECRET### --format Json`

    - List guilds: `docker run tyrrrz/discordchatexporter:stable guilds`

    - List channels: `docker run tyrrrz/discordchatexporter:stable channels --guild ###ID###`

## Processing tools

- [jq](https://stedolan.github.io/jq/)

    - Example: `jq -j -M --stream -f discord1.jq` [discord1.jq](https://gist.github.com/willsheppard/f9b7cc9b130784ffd7bd8f144cf892f8)

- [XPath Helper](https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl)

    - Example: Ctrl-Shift-X (or Command-Shift-X on Mac)

- HAR recorders (if for some reason Chrome's "Save as HAR" feature isn't sufficient)

  - [AutoHAR](https://github.com/Aloisius/autohar)

  - [HAR Recorder](https://chrome.google.com/webstore/detail/har-recorder/emfabjnfjiknifjlfpjobbecfepplhkd)

- HAR extractors (to retrieve the original content from inside a HAR file) 

  - [How can I extract the contents of a .HAR file](https://www.reddit.com/r/browsers/comments/bczaiz/how_can_i_extract_the_contents_of_a_har_file_in)

  - [quentint/har-extract.js](https://gist.github.com/quentint/7236a6a7cae187507b0c1dfd4b1ed1c5)

  - [JC3/harextract](https://github.com/JC3/harextract) + see [forks](https://github.com/JC3/harextract/forks)

  - [crazypatoto/MorkVideoExtractor](https://github.com/crazypatoto/MorkVideoExtractor)

  - [outersky/har-tools](https://github.com/outersky/har-tools)

## Techniques

### Combine streamed .ts files and m3u8 playlist/chunklist into an mpeg/mp4 video

- After extracting the .m4u8 and .ts files from HAR, run something like:

  - `ffmpeg -i playlist.m3u8 -c copy -bsf:a aac_adtstoasc output.mp4`

### Extract playlist data from YouTube and YT Music

Input: https://music.youtube.com/library/playlists

Goal: Extract a list of playlists suitable for feeding to youtube-dl / yt-dlp

These are all equivalent ways to achieve the same thing:

1. Chrome: Save As | Web Page, HTML Only --> doesn't work, empty page

1. Chrome: Save As | Web page, Single File --> works, full HTML, embeds images, uses "quoted printable encoding", i.e. `=` becomes `=3D`

1. Chrome: Save As | Web page, Complete --> works, full HTML, not encoded, saves album/playlist covers as image files.

1. Chrome: DevTools | Elements |  | right-click | Copy | Copy element | Paste into text editor --> works, full HTML

1. Chrome: Extensions | XPath Helper | Ctrl-Shift-X | Hover over element | Shift | Edit XPath to remove e.g. `[409]` | Append `/@href` --> works, list of URLs

1. Chrome: DevTools | Console | [](https://stackoverflow.com/a/7474386) | [](https://stackoverflow.com/a/20495940) | `$(document).xpathEvaluate('//body/div/foo')`

1. Chrome: DevTools | Elements | right-click | Copy | Copy JS | (paste into console and edit - see snippet below)

1. Chrome: Extensions | AutoHAR | chrome --auto-open-devtools-for-tabs | ...etc

1. Chrome: DevTools | Network | Filter | Fetch/XHR | https://music.youtube.com/youtubei/v1/browse/...etc... | (a) Save all as HAR with content, (b) (down-arrow near top-right) Export HAR... 

1. (Idea) Headless chrome + puppeteer or playwright

Javascript snippet:

    

```

items = document.querySelectorAll("#items > ytmusic-two-row-item-renderer");

items.forEach((item) => {

    drill = item.querySelector("div.details.style-scope.ytmusic-two-row-item-renderer");

    span = drill.querySelector('span > yt-formatted-string > span:nth-child(3)');

    if (! span) { return };

    console.log(

        drill.querySelector('a').toString()

        + "    " + span.innerHTML

        + "    " + drill.querySelector('a').text

    );

});

```

Shorter snippet:

```

var output = '';

document.querySelectorAll("h3 > div > div > a").forEach((item) => { output += item.text + "\n"; });

console.log(output);

console.save(output);

```

[Save data out of console](https://stackoverflow.com/questions/41032565/how-to-copy-the-objects-from-chrome-console-window) via clipboard or writing a file (provides `console.save()` command.

### Case studies

- **naive-slack-scraper**. Hypothetical code that cannot exist, as it potentially wouldn't follow terms of service. So don't look for it.

- [pokemon-data](https://github.com/pokemon-names/pokemon-data/blob/main/data/README.md). jq examples.

- [moar jq examples](https://wills-tech-notes.blogspot.com/2022/08/jq-cheat-sheet.html)

## Discussion

- If an archive of data is made, and that data cannot be viewed reasonably easily in a way similar to its original presentation by a person on the street, then it can be considered not to be viewable at all. It may as well not exist for public purposes. A possible retort is to assert "A viewer program could be built". But if that viewer program doesn't yet exist, then the data still can't be viewed. It's a Schroedinger's archive.

## Communities

- https://www.reddit.com/r/DataHoarder

- https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community

## Similar projects

- https://github.com/iipc/awesome-web-archiving

- https://github.com/lorien/awesome-web-scraping

- https://github.com/igorbarinov/awesome-data-engineering

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/all-the-data/awesome-data-hoarding

Awesome Lists containing this project

README