https://github.com/httpreserve/tikalinkextract
Tika based link (URL) extractor for httpreserve
https://github.com/httpreserve/tikalinkextract
archives code4lib digitalpreservation httpreserve iipc tika tika-wrapper url-extractor webarchiving
Last synced: 2 months ago
JSON representation
Tika based link (URL) extractor for httpreserve
- Host: GitHub
- URL: https://github.com/httpreserve/tikalinkextract
- Owner: httpreserve
- Created: 2017-04-03T02:35:58.000Z (almost 9 years ago)
- Default Branch: main
- Last Pushed: 2025-04-26T19:56:42.000Z (11 months ago)
- Last Synced: 2025-04-26T20:29:32.546Z (11 months ago)
- Topics: archives, code4lib, digitalpreservation, httpreserve, iipc, tika, tika-wrapper, url-extractor, webarchiving
- Language: HTML
- Homepage:
- Size: 171 MB
- Stars: 10
- Watchers: 3
- Forks: 0
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
Awesome Lists containing this project
- awesome-web-archiving - tikalinkextract - Extract hyperlinks as a seed for web archiving from folders of document types that can be parsed by Apache Tika (Golang, Apache Tika Server). *(In Development)* (Tools & Software / Utilities)
- webarchiving-awesome-graph - tikalinkextract - Extract hyperlinks as a seed for web archiving from folders of document types that can be parsed by Apache Tika (Golang, Apache Tika Server). 💽 (Tools & Software / Utilities)
README
# tikalinkextract
Tika client for httpreserve.
## About
Tikalinkextract requires users start the Tika HTTP server, and then it provides
a way for them to automate the batch processing of those files into its text
extraction mechanism. The text is then processed to look for hyperlinks which
are extracted and output to stdout. There are examples you can try below.
More information is available on the OPF website:
[Hyperlinks in your files? How to get them out using tikalinkextract][opf-1]
[opf-1]: https://openpreservation.org/blogs/hyperlinks-in-your-files-how-to-get-them-out-using-tikalinkextract/
## Demo
[](https://asciinema.org/a/143271)
## Use with Wget
### Extract the links from your files using seeds option
```sh
./tikalinkextract -seeds -file archives-nz-demo/ > transferlinks.txt
```
### Use the seeds to generate a warc file
```sh
wget -T 10 --tries=1 --page-requisites --span-hosts --convert-links --execute robots=off --adjust-extension --no-directories --directory-prefix=output --warc-cdx --warc-file=accession --wait=0.1 --user-agent=httpreserve-wget/0.0.1 -i transferlinks.txt
```
See [explainshell.com][explain-1]
[explain-1]: https://explainshell.com/explain?cmd=wget+-T+10+--tries%3D1+--page-requisites+--span-hosts+--convert-links++--execute+robots%3Doff+--adjust-extension+--no-directories+--directory-prefix%3Doutput+--warc-cdx+--warc-file%3Daccession+--wait%3D0.1+--user-agent%3Dhttpreserve-wget%2F0.0.1+-i+transferlinks.txt
## Resources that might be useful
* [REGEX Guru: Detecting URLS in text][regex-1]
[regex-1]: http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/
## License
Tika is licensed as [Apache License 2.0][tika-license].
This tool is licensed [GNU General Public License Version 3](LICENSE).
[tika-license]: http://www.apache.org/licenses/