https://github.com/internetarchive/strainer
Heritrix frontier files manipulation tool.
https://github.com/internetarchive/strainer
crawling frontier heritrix
Last synced: about 2 months ago
JSON representation
Heritrix frontier files manipulation tool.
- Host: GitHub
- URL: https://github.com/internetarchive/strainer
- Owner: internetarchive
- License: agpl-3.0
- Created: 2021-06-22T15:44:06.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2021-06-23T13:51:08.000Z (almost 5 years ago)
- Last Synced: 2025-12-27T01:59:11.839Z (5 months ago)
- Topics: crawling, frontier, heritrix
- Language: Go
- Homepage:
- Size: 38.1 KB
- Stars: 4
- Watchers: 16
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[](http://golang.org)
[](https://github.com/internetarchive/strainer)
[](https://goreportcard.com/report/github.com/internetarchive/strainer)
[](https://github.com/internetarchive/strainer/blob/master/LICENSE)
# Strainer
Parse Heritrix frontiers, build seeds list, exclude hosts, crawl better.
## What's strainer?
Strainer is a tool built to manipulate Heritrix frontier files.
It takes one or multiple (GZIP or not) frontier files and build new seeds list based on them.
It can exclude specific hosts, limit the number of occurrence of hosts in the final seeds list, and give you interesting statistics about your frontiers.
It's still a WIP, but it's usable right now.
## Usage
```
usage: strainer [-h|--help] -f|--file "" [-f|--file "" ...]
[-m|--max-host-occurrence ] [-e|--excluded-hosts
"" [-e|--excluded-hosts "" ...]] [--temp-dir
""]
manipulate Heritrix frontier files
Arguments:
-h --help Print help information
-f --file Frontier file(s) to process, can be .gz files.
-m --max-host-occurrence Max number of a occurrence of a given host to
accept in the final seed list. If an host is
parsed more than X times, new occurrences of that
host past that limit will be excluded. -1 value
means no limit. Default: -1
-e --excluded-hosts Specific hosts to exclude from the final seed
list.
--temp-dir Temporary directory to use for the key/value
database. Default: /tmp
```