Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Projects in Awesome Lists tagged with webarchiving

A curated list of projects in awesome lists tagged with webarchiving .

https://github.com/N0taN3rd/Squidwarc

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

browser-automation chrome chrome-headless crawler crawling headless-chrome high-fidelity-preservation puppeteer webarchives webarchiving

Last synced: 01 Aug 2024

https://github.com/n0tan3rd/squidwarc

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

browser-automation chrome chrome-headless crawler crawling headless-chrome high-fidelity-preservation puppeteer webarchives webarchiving

Last synced: 31 Jul 2024

https://github.com/N0taN3rd/node-warc

Parse And Create Web ARChive (WARC) files with node.js

chrome-remote-interface pupeteer warc warc-files web-archives web-archiving webarchive webarchiving

Last synced: 17 Aug 2024

https://github.com/ArchiveTeam/wget-lua

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

archiveteam archiving crawl crawler crawlers crawling downloader ftp lua scraper scraping spider warc webarchiving wget wget-lua zstd

Last synced: 06 Aug 2024

https://github.com/peterk/warcworker

A dockerized, queued high fidelity web archiver based on Squidwarc

archiving high-fidelity-preservation preservation webarchives webarchiving

Last synced: 01 Aug 2024

https://github.com/commoncrawl/cc-notebooks

Various Jupyter notebooks about Common Crawl data

aws-athena common-crawl commoncrawl jupyter-notebook webarchiving webgraph-framework

Last synced: 17 Aug 2024

https://github.com/datacoon/metawarc

metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)

metadata osint osint-python warc warc-files webarchiving

Last synced: 01 Aug 2024

https://github.com/ibnesayeed/archival-tests

A set of web archival replay test cases

archival-replay memento replay-tests testing webarchive webarchiving

Last synced: 01 Oct 2024