https://github.com/httpreserve/tikalinkextract

Tika based link (URL) extractor for httpreserve
https://github.com/httpreserve/tikalinkextract

archives code4lib digitalpreservation httpreserve iipc tika tika-wrapper url-extractor webarchiving

Last synced: 6 months ago
JSON representation

Tika based link (URL) extractor for httpreserve

Host: GitHub
URL: https://github.com/httpreserve/tikalinkextract
Owner: httpreserve
Created: 2017-04-03T02:35:58.000Z (over 9 years ago)
Default Branch: main
Last Pushed: 2025-04-26T19:56:42.000Z (about 1 year ago)
Last Synced: 2025-04-26T20:29:32.546Z (about 1 year ago)
Topics: archives, code4lib, digitalpreservation, httpreserve, iipc, tika, tika-wrapper, url-extractor, webarchiving
Language: HTML
Homepage:
Size: 171 MB
Stars: 10
Watchers: 3
Forks: 0
Open Issues: 6
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml

Awesome Lists containing this project

webarchiving-awesome-graph - tikalinkextract - Extract hyperlinks as a seed for web archiving from folders of document types that can be parsed by Apache Tika (Golang, Apache Tika Server). 💽 ⭐ 11 👀 2 (Tools & Software / Utilities)
awesome-web-archiving - tikalinkextract - Extract hyperlinks as a seed for web archiving from folders of document types that can be parsed by Apache Tika (Golang, Apache Tika Server). *(In Development)* (Tools & Software / Utilities)

README

# tikalinkextract

Tika client for httpreserve.

## About

Tikalinkextract requires users start the Tika HTTP server, and then it provides
a way for them to automate the batch processing of those files into its text
extraction mechanism. The text is then processed to look for hyperlinks which
are extracted and output to stdout. There are examples you can try below.

More information is available on the OPF website:
[Hyperlinks in your files? How to get them out using tikalinkextract][opf-1]

[opf-1]: https://openpreservation.org/blogs/hyperlinks-in-your-files-how-to-get-them-out-using-tikalinkextract/

## Demo

[![asciicast](https://asciinema.org/a/143271.png)](https://asciinema.org/a/143271)

## Use with Wget

### Extract the links from your files using seeds option

```sh
./tikalinkextract -seeds -file archives-nz-demo/ > transferlinks.txt
```

### Use the seeds to generate a warc file

```sh
wget -T 10 --tries=1 --page-requisites --span-hosts --convert-links --execute robots=off --adjust-extension --no-directories --directory-prefix=output --warc-cdx --warc-file=accession --wait=0.1 --user-agent=httpreserve-wget/0.0.1 -i transferlinks.txt
```

See [explainshell.com][explain-1]

[explain-1]: https://explainshell.com/explain?cmd=wget+-T+10+--tries%3D1+--page-requisites+--span-hosts+--convert-links++--execute+robots%3Doff+--adjust-extension+--no-directories+--directory-prefix%3Doutput+--warc-cdx+--warc-file%3Daccession+--wait%3D0.1+--user-agent%3Dhttpreserve-wget%2F0.0.1+-i+transferlinks.txt

## Resources that might be useful

* [REGEX Guru: Detecting URLS in text][regex-1]

[regex-1]: http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/

## License

Tika is licensed as [Apache License 2.0][tika-license].

This tool is licensed [GNU General Public License Version 3](LICENSE).

[tika-license]: http://www.apache.org/licenses/

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/httpreserve/tikalinkextract

Awesome Lists containing this project

README