Projects in Awesome Lists tagged with warc

https://github.com/pirate/ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

archivebox backups bookmark-archiver browser-bookmarks chromium digipres firefox headless-browser internet-archiving pinboard pocket python rss self-hosted singlefile warc wayback-machine web-archiving wget youtube-dl

Last synced: 30 Oct 2024

https://github.com/archivebox/archivebox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

archivebox backups bookmark-archiver browser-bookmarks chromium digipres firefox headless-browser internet-archiving pinboard pocket python rss self-hosted singlefile warc wayback-machine web-archiving wget youtube-dl

Last synced: 16 Dec 2024

https://github.com/ArchiveBox/ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

archivebox backups bookmark-archiver browser-bookmarks chromium digipres firefox headless-browser internet-archiving pinboard pocket python rss self-hosted singlefile warc wayback-machine web-archiving wget youtube-dl

Last synced: 25 Oct 2024

https://github.com/internetarchive/heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

heritrix java warc webcrawling

Last synced: 17 Dec 2024

https://github.com/Rhizome-Conifer/conifer

Collect and revisit web pages.

archives docker python pywb warc wayback web-archiving webrecorder

Last synced: 29 Oct 2024

https://github.com/rhizome-conifer/conifer

Collect and revisit web pages.

archives docker python pywb warc wayback web-archiving webrecorder

Last synced: 15 Dec 2024

https://github.com/archiveteam/grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

archiving crawl crawler spider warc

Last synced: 19 Dec 2024

https://github.com/ArchiveTeam/grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

archiving crawl crawler spider warc

Last synced: 06 Nov 2024

https://github.com/webrecorder/archiveweb.page

A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!

archiving browser-extension chromium extension wacz warc web-archiving webrecorder

Last synced: 19 Dec 2024

https://github.com/webrecorder/replayweb.page

Serverless replay of web archives directly in the browser

replay-web-page service-worker wacz warc wayback-machine web-archive web-archiving web-replay

Last synced: 21 Dec 2024

https://github.com/oduwsdl/ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

docker ipfs memento memento-rfc python service-worker warc wayback web-archiving

Last synced: 21 Dec 2024

https://github.com/webrecorder/browsertrix-crawler

Run a high-fidelity browser-based crawler in a single Docker container

crawler crawling wacz warc web-archiving web-crawler webrecorder

Last synced: 18 Dec 2024

https://github.com/webrecorder/webrecorder-player

Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)

electron pywb warc web-archiving webrecorder

Last synced: 04 Nov 2024

https://github.com/florents-tselai/warcdb

WarcDB: Web crawl data as SQLite databases.

cli crawling database sqlite warc web-archiving web-data

Last synced: 21 Dec 2024

https://github.com/Florents-Tselai/WarcDB

WarcDB: Web crawl data as SQLite databases.

cli crawling database sqlite warc web-archiving web-data

Last synced: 06 Nov 2024

https://github.com/machawk1/wail

:whale2: Web Archiving Integration Layer: One-Click User Instigated Preservation

gui heritrix openwayback pyinstaller python warc wayback web-archiving

Last synced: 15 Dec 2024

https://github.com/webrecorder/warcio

Streaming WARC/ARC library for fast web archive IO

python pywb warc web-archives web-archiving

Last synced: 21 Dec 2024

https://github.com/commoncrawl/news-crawl

News crawling with StormCrawler - stores content as WARC

apache-storm common-crawl commoncrawl crawler news storm-crawler warc web-crawler

Last synced: 16 Nov 2024

https://github.com/machawk1/warcreate

Chrome extension to "Create WARC files from any webpage"

chrome-extension warc web-archiving

Last synced: 18 Dec 2024

https://github.com/cocrawler/cocrawler

CoCrawler is a versatile web crawler built using modern tools and concurrency.

aiohttp aiohttp-client async-python concurrency crawler pluggable-modules python3 screenshot warc

Last synced: 29 Oct 2024

https://github.com/webrecorder/browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!

archiving cloud kubernetes wacz warc web-archive web-archiving webrecorder

Last synced: 20 Dec 2024

https://github.com/cocrawler/cdx_toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

cdx cdx-api commoncrawl python warc web-archives web-archiving

Last synced: 06 Nov 2024

https://github.com/helgeho/ArchiveSpark

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

archivespark internet-archive spark spark-framework warc web-archiving webarchive

Last synced: 06 Nov 2024

https://github.com/N0taN3rd/wail

:whale2: One-Click User Instigated Preservation

browser-based-presrevation electron high-fidelity-preservation warc web-archiving

Last synced: 04 Nov 2024

https://github.com/archiveteam/wget-lua

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

archiveteam archiving crawl crawler crawlers crawling downloader ftp lua scraper scraping spider warc webarchiving wget wget-lua zstd

Last synced: 21 Dec 2024

https://github.com/maxcountryman/warc-parquet

🗄️ A simple CLI for converting WARC to Parquet.

crawling duckdb parquet warc web-archiving

Last synced: 18 Dec 2024

https://github.com/ArchiveTeam/wget-lua

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

archiveteam archiving crawl crawler crawlers crawling downloader ftp lua scraper scraping spider warc webarchiving wget wget-lua zstd

Last synced: 25 Nov 2024

https://github.com/n0tan3rd/node-warc

Parse And Create Web ARChive (WARC) files with node.js

chrome-remote-interface pupeteer warc warc-files web-archives web-archiving webarchive webarchiving

Last synced: 17 Nov 2024

https://github.com/N0taN3rd/node-warc

Parse And Create Web ARChive (WARC) files with node.js

chrome-remote-interface pupeteer warc warc-files web-archives web-archiving webarchive webarchiving

Last synced: 09 Dec 2024

https://github.com/CGamesPlay/chronicler

Offline-first web browser

browser electron warc

Last synced: 07 Nov 2024

https://github.com/centic9/commoncrawldocumentdownload

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

cdx-files commoncrawl java mime-types warc

Last synced: 17 Dec 2024

https://github.com/internetarchive/cdx-summary

Summarize web archive capture index (CDX) files.

archive cdx collection nodejs python report statistics summary warc web-archive webcomponents

Last synced: 17 Nov 2024

https://github.com/archivesunleashed/warclight

A Rails engine supporting the discovery of web archives.

blacklight discovery rails rails-engine ruby solr warc webarchive-discovery webarchives

Last synced: 02 Dec 2024

https://github.com/pirate/internet-archiving-talk

🎭 An introduction to the Internet Archiving ecosystem, tooling, and some of the ethical dilemmas that the community faces.

archivebox censorship ethics internet-archiving slideshow talks warc web-archiving wget

Last synced: 28 Oct 2024

https://github.com/openzim/warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format

scraper warc zim

Last synced: 16 Dec 2024

https://github.com/jedireza/warc

:gear: A Rust library for reading and writing WARC files

rust rust-library warc

Last synced: 17 Dec 2024

https://github.com/datacoon/metawarc

metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)

metadata osint osint-python warc warc-files webarchiving

Last synced: 06 Nov 2024

https://github.com/webrecorder/cdxj-indexer

CDXJ Indexing of WARC/ARCs

warc web-archiving

Last synced: 19 Dec 2024

https://github.com/hrbrmstr/warc

:card_index: Tools to Work with the Web Archive Ecosystem in R

r r-cyber rstats warc warc-ecosystem warc-files

Last synced: 11 Oct 2024

https://github.com/internetarchive/scrapy-warcio

Support for writing WARC files with Scrapy

python scrapy warc web-archiving

Last synced: 17 Nov 2024

https://github.com/ArchiveTeam/WebArchiver

Decentralized web archiving

archiver archiving crawler decentralized python warc web webarchiving

Last synced: 06 Nov 2024

https://github.com/archiveteam/webarchiver

Decentralized web archiving

archiver archiving crawler decentralized python warc web webarchiving

Last synced: 19 Nov 2024

https://github.com/corentinb/warc

Read and write WARC files in Go

archiving go warc

Last synced: 17 Nov 2024

https://github.com/openzim/zimit-frontend

Zimit Public Web UI

spider warc zim

Last synced: 12 Nov 2024

https://github.com/orottier/rust-warc

A high performance and easy to use Web Archive (WARC) file reader

parser rust warc

Last synced: 09 Nov 2024

https://github.com/oduwsdl/off-topic-memento-toolkit

This system evaluates a collection of mementos (archived web pages) to determine which are off topic. The collection can be part of an Archive-It collection, a single TimeMap, or stored in a WARC file.

cosine measure memento simhash timemap topic warc