Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/httpreserve/tikalinkextract
Tika based link (URL) extractor for httpreserve
https://github.com/httpreserve/tikalinkextract
archives code4lib digitalpreservation httpreserve iipc tika tika-wrapper url-extractor webarchiving
Last synced: about 2 months ago
JSON representation
Tika based link (URL) extractor for httpreserve
- Host: GitHub
- URL: https://github.com/httpreserve/tikalinkextract
- Owner: httpreserve
- Created: 2017-04-03T02:35:58.000Z (about 7 years ago)
- Default Branch: main
- Last Pushed: 2021-06-02T08:58:57.000Z (about 3 years ago)
- Last Synced: 2024-02-03T04:34:01.433Z (5 months ago)
- Topics: archives, code4lib, digitalpreservation, httpreserve, iipc, tika, tika-wrapper, url-extractor, webarchiving
- Language: HTML
- Homepage:
- Size: 171 MB
- Stars: 8
- Watchers: 4
- Forks: 1
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
Lists
- awesome-web-archiving - tikalinkextract - Extract hyperlinks as a seed for web archiving from folders of document types that can be parsed by Apache Tika (Golang, Apache Tika Server). *(In Development)* (Tools & Software / Utilities)
- awesome-web-archiving - tikalinkextract - Extract hyperlinks as a seed for web archiving from folders of document types that can be parsed by Apache Tika (Golang, Apache Tika Server). (In Development) (Tools & Software / Utilities)
README
# tika-httpreserve
Tika client for httpreserve
## Demo
[![asciicast](https://asciinema.org/a/143271.png)](https://asciinema.org/a/143271)
## Use with Wget
**Extract the links from your files using seeds option**
./tikalinkextract -seeds -file archives-nz-demo/ > transferlinks.txt
**Use the seeds to generate a warc file**
wget --page-requisites --span-hosts --convert-links --execute robots=off --adjust-extension --no-directories --directory-prefix=output --warc-cdx --warc-file=accession.warc --wait=0.1 --user-agent=httpreserve-wget/0.0.1 -i transferlinks.txt
### Known Issues
* HTTP links that are formatted in such a way to be split across lines, thus include a newline \n character.
### Resources that might help
* [REGEX Guru: Detecting URLS in text](http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/)
## License
Tika is licensed as follows: http://www.apache.org/licenses/
This tool is licensed GNU General Public License Version 3. [Full Text](LICENSE)