https://github.com/sebastian-nagel/warc-crawler
Process web archives (WARC format) with StormCrawler and index content into Elasticsearch or Solr
https://github.com/sebastian-nagel/warc-crawler
apache-storm elasticsearch solr stormcrawler warc warc-files web-archives
Last synced: 4 months ago
JSON representation
Process web archives (WARC format) with StormCrawler and index content into Elasticsearch or Solr
- Host: GitHub
- URL: https://github.com/sebastian-nagel/warc-crawler
- Owner: sebastian-nagel
- Created: 2020-06-23T18:00:11.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2022-11-16T01:46:53.000Z (almost 3 years ago)
- Last Synced: 2023-03-01T10:40:50.353Z (over 2 years ago)
- Topics: apache-storm, elasticsearch, solr, stormcrawler, warc, warc-files, web-archives
- Language: FLUX
- Homepage:
- Size: 44.9 KB
- Stars: 6
- Watchers: 4
- Forks: 1
- Open Issues: 1