Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Projects in Awesome Lists tagged with warc

A curated list of projects in awesome lists tagged with warc .

https://github.com/pirate/ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

archivebox backups bookmark-archiver browser-bookmarks chromium digipres firefox headless-browser internet-archiving pinboard pocket python rss self-hosted singlefile warc wayback-machine web-archiving wget youtube-dl

Last synced: 30 Oct 2024

https://github.com/archivebox/archivebox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

archivebox backups bookmark-archiver browser-bookmarks chromium digipres firefox headless-browser internet-archiving pinboard pocket python rss self-hosted singlefile warc wayback-machine web-archiving wget youtube-dl

Last synced: 16 Dec 2024

https://github.com/ArchiveBox/ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

archivebox backups bookmark-archiver browser-bookmarks chromium digipres firefox headless-browser internet-archiving pinboard pocket python rss self-hosted singlefile warc wayback-machine web-archiving wget youtube-dl

Last synced: 25 Oct 2024

https://github.com/internetarchive/heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

heritrix java warc webcrawling

Last synced: 17 Dec 2024

https://github.com/archiveteam/grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

archiving crawl crawler spider warc

Last synced: 19 Dec 2024

https://github.com/ArchiveTeam/grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

archiving crawl crawler spider warc

Last synced: 06 Nov 2024

https://github.com/webrecorder/archiveweb.page

A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!

archiving browser-extension chromium extension wacz warc web-archiving webrecorder

Last synced: 19 Dec 2024

https://github.com/webrecorder/replayweb.page

Serverless replay of web archives directly in the browser

replay-web-page service-worker wacz warc wayback-machine web-archive web-archiving web-replay

Last synced: 21 Dec 2024

https://github.com/oduwsdl/ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

docker ipfs memento memento-rfc python service-worker warc wayback web-archiving

Last synced: 21 Dec 2024

https://github.com/webrecorder/browsertrix-crawler

Run a high-fidelity browser-based crawler in a single Docker container

crawler crawling wacz warc web-archiving web-crawler webrecorder

Last synced: 18 Dec 2024

https://github.com/webrecorder/webrecorder-player

Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)

electron pywb warc web-archiving webrecorder

Last synced: 04 Nov 2024

https://github.com/florents-tselai/warcdb

WarcDB: Web crawl data as SQLite databases.

cli crawling database sqlite warc web-archiving web-data

Last synced: 21 Dec 2024

https://github.com/Florents-Tselai/WarcDB

WarcDB: Web crawl data as SQLite databases.

cli crawling database sqlite warc web-archiving web-data

Last synced: 06 Nov 2024

https://github.com/machawk1/wail

:whale2: Web Archiving Integration Layer: One-Click User Instigated Preservation

gui heritrix openwayback pyinstaller python warc wayback web-archiving

Last synced: 15 Dec 2024

https://github.com/webrecorder/warcio

Streaming WARC/ARC library for fast web archive IO

python pywb warc web-archives web-archiving

Last synced: 21 Dec 2024

https://github.com/commoncrawl/news-crawl

News crawling with StormCrawler - stores content as WARC

apache-storm common-crawl commoncrawl crawler news storm-crawler warc web-crawler

Last synced: 16 Nov 2024

https://github.com/machawk1/warcreate

Chrome extension to "Create WARC files from any webpage"

chrome-extension warc web-archiving

Last synced: 18 Dec 2024

https://github.com/cocrawler/cocrawler

CoCrawler is a versatile web crawler built using modern tools and concurrency.

aiohttp aiohttp-client async-python concurrency crawler pluggable-modules python3 screenshot warc

Last synced: 29 Oct 2024

https://github.com/webrecorder/browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!

archiving cloud kubernetes wacz warc web-archive web-archiving webrecorder

Last synced: 20 Dec 2024

https://github.com/cocrawler/cdx_toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

cdx cdx-api commoncrawl python warc web-archives web-archiving

Last synced: 06 Nov 2024

https://github.com/helgeho/ArchiveSpark

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

archivespark internet-archive spark spark-framework warc web-archiving webarchive

Last synced: 06 Nov 2024

https://github.com/N0taN3rd/wail

:whale2: One-Click User Instigated Preservation

browser-based-presrevation electron high-fidelity-preservation warc web-archiving

Last synced: 04 Nov 2024

https://github.com/archiveteam/wget-lua

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

archiveteam archiving crawl crawler crawlers crawling downloader ftp lua scraper scraping spider warc webarchiving wget wget-lua zstd

Last synced: 21 Dec 2024

https://github.com/maxcountryman/warc-parquet

🗄️ A simple CLI for converting WARC to Parquet.

crawling duckdb parquet warc web-archiving

Last synced: 18 Dec 2024

https://github.com/ArchiveTeam/wget-lua

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

archiveteam archiving crawl crawler crawlers crawling downloader ftp lua scraper scraping spider warc webarchiving wget wget-lua zstd

Last synced: 25 Nov 2024

https://github.com/n0tan3rd/node-warc

Parse And Create Web ARChive (WARC) files with node.js

chrome-remote-interface pupeteer warc warc-files web-archives web-archiving webarchive webarchiving

Last synced: 17 Nov 2024

https://github.com/N0taN3rd/node-warc

Parse And Create Web ARChive (WARC) files with node.js

chrome-remote-interface pupeteer warc warc-files web-archives web-archiving webarchive webarchiving

Last synced: 09 Dec 2024

https://github.com/CGamesPlay/chronicler

Offline-first web browser

browser electron warc

Last synced: 07 Nov 2024

https://github.com/centic9/commoncrawldocumentdownload

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

cdx-files commoncrawl java mime-types warc

Last synced: 17 Dec 2024

https://github.com/archivesunleashed/warclight

A Rails engine supporting the discovery of web archives.

blacklight discovery rails rails-engine ruby solr warc webarchive-discovery webarchives

Last synced: 02 Dec 2024

https://github.com/pirate/internet-archiving-talk

🎭 An introduction to the Internet Archiving ecosystem, tooling, and some of the ethical dilemmas that the community faces.

archivebox censorship ethics internet-archiving slideshow talks warc web-archiving wget

Last synced: 28 Oct 2024

https://github.com/openzim/warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format

scraper warc zim

Last synced: 16 Dec 2024

https://github.com/jedireza/warc

:gear: A Rust library for reading and writing WARC files

rust rust-library warc

Last synced: 17 Dec 2024

https://github.com/datacoon/metawarc

metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)

metadata osint osint-python warc warc-files webarchiving

Last synced: 06 Nov 2024

https://github.com/webrecorder/cdxj-indexer

CDXJ Indexing of WARC/ARCs

warc web-archiving

Last synced: 19 Dec 2024

https://github.com/hrbrmstr/warc

:card_index: Tools to Work with the Web Archive Ecosystem in R

r r-cyber rstats warc warc-ecosystem warc-files

Last synced: 11 Oct 2024

https://github.com/internetarchive/scrapy-warcio

Support for writing WARC files with Scrapy

python scrapy warc web-archiving

Last synced: 17 Nov 2024

https://github.com/corentinb/warc

Read and write WARC files in Go

archiving go warc

Last synced: 17 Nov 2024

https://github.com/openzim/zimit-frontend

Zimit Public Web UI

spider warc zim

Last synced: 12 Nov 2024

https://github.com/orottier/rust-warc

A high performance and easy to use Web Archive (WARC) file reader

parser rust warc

Last synced: 09 Nov 2024

https://github.com/oduwsdl/off-topic-memento-toolkit

This system evaluates a collection of mementos (archived web pages) to determine which are off topic. The collection can be part of an Archive-It collection, a single TimeMap, or stored in a WARC file.

cosine measure memento simhash timemap topic warc

Last synced: 08 Nov 2024

https://github.com/hrbrmstr/jwatr

:card_index: Tools to Query and Create Web Archive Files Using the Java Web Archive Toolkit in R

java r r-cyber rstats warc

Last synced: 11 Oct 2024

https://github.com/tokenmill/common-crawl-utils

Various Common Crawl utilities in Clojure.

cdx-api clojure clojure-library common-crawl warc

Last synced: 10 Nov 2024

https://github.com/AlexGustafsson/larch

A self-hosted service and toolset for managing, archiving, viewing and sharing bookmarks

archiver bookmark-manager golang links self-hosted warc

Last synced: 09 Dec 2024

https://github.com/code402/warc-benchmark

Sample code to grep Common Crawl WARC files in Go, Java, Node and Python.

common-crawl commoncrawl warc

Last synced: 16 Nov 2024

https://github.com/miku/ttarc

Minimalistic TikTok trending archiver.

archiving tiktok warc

Last synced: 24 Nov 2024

https://github.com/dlrobertson/warc-c

A WIP WARC parser in C

warc

Last synced: 24 Nov 2024

https://github.com/grey-land/warc-browser

a cli toolkit for working with web archives

chromedp devtools go golang rod warc web-archive

Last synced: 16 Dec 2024

https://github.com/hrbrmstr/jwatjars

Java '.jar' Files for 'jwatr'

java r rstats warc

Last synced: 15 Nov 2024

https://github.com/q-m/scrapy-webarchive

A plugin for Scrapy that allows users to capture and export web archives in the WARC and WACZ formats during crawling.

scrapy wacz warc webarchive webarchive-data-scraping

Last synced: 09 Nov 2024

https://github.com/geopjr/archives

[MIRROR] Create and view web archives

archive flash ruffle singlefile wacz warc

Last synced: 14 Dec 2024

https://github.com/govau/wofg-web-filters

Filters for processing Web ARChive (WARC) files as part of the WofG Web Reporting Service

groovy-scripts warc

Last synced: 20 Nov 2024

https://github.com/cldellow/gzip

A fork of java.util.zip.GZIPInputStream that emits the offsets of nested streams.

compression gzip warc

Last synced: 12 Dec 2024

https://github.com/marhop/vim-warc

Vim syntax highlighting for WARC files

vim-plugin warc

Last synced: 08 Nov 2024

https://github.com/marinoandrea/wikidata-entity-linking

CLI to extract named entities from web pages and link them to potential entity candidates in the WikiData knowledge base.

entity-linking nlp trident warc web wikidata

Last synced: 24 Nov 2024

https://github.com/govau/warcraider

Convert WARC files into Avro for big data processing

avro bigquery crawler rust warc

Last synced: 20 Nov 2024

https://github.com/shinosaki/samourai-wallet-blog-archive

Archive of Samourai Wallet Blog/OXT Research Blog (WARC format)

archive samourai-wallet warc

Last synced: 12 Dec 2024

https://0xacab.org/pip/warc.cr

crystal warc

Last synced: 12 Dec 2024