{"id":33124119,"url":"https://github.com/httpreserve/tikalinkextract","last_synced_at":"2026-01-17T02:10:21.923Z","repository":{"id":144429474,"uuid":"87032181","full_name":"httpreserve/tikalinkextract","owner":"httpreserve","description":"Tika based link (URL) extractor for httpreserve","archived":false,"fork":false,"pushed_at":"2025-04-26T19:56:42.000Z","size":179813,"stargazers_count":10,"open_issues_count":6,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-26T20:29:32.546Z","etag":null,"topics":["archives","code4lib","digitalpreservation","httpreserve","iipc","tika","tika-wrapper","url-extractor","webarchiving"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/httpreserve.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null},"funding":{"ko_fi":"https://ko-fi.com/beet_keeper"}},"created_at":"2017-04-03T02:35:58.000Z","updated_at":"2025-04-26T19:56:52.000Z","dependencies_parsed_at":"2025-04-26T20:35:38.161Z","dependency_job_id":null,"html_url":"https://github.com/httpreserve/tikalinkextract","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/httpreserve/tikalinkextract","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/httpreserve%2Ftikalinkextract","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/httpreserve%2Ftikalinkextract/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/httpreserve%2Ftikalinkextract/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/httpreserve%2Ftikalinkextract/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/httpreserve","download_url":"https://codeload.github.com/httpreserve/tikalinkextract/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/httpreserve%2Ftikalinkextract/sbom","scorecard":{"id":471289,"data":{"date":"2025-08-11","repo":{"name":"github.com/httpreserve/tikalinkextract","commit":"dd0962c7482548aed7a09a680c63e65a93e3f21b"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":2,"checks":[{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Binary-Artifacts","score":7,"reason":"binaries present in source code","details":["Warn: binary detected: test-files/exe/go-exe:1","Warn: binary detected: test-files/exe/go-exe.exe:1","Warn: binary detected: tools/tika-server-standard-3.1.0.jar:1"],"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"License","score":0,"reason":"license file not detected","details":["Warn: project does not have a license file"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Signed-Releases","score":0,"reason":"Project has not signed or included provenance with any releases.","details":["Warn: release artifact tle-0.0.3 not signed: https://api.github.com/repos/httpreserve/tikalinkextract/releases/16158288","Warn: release artifact 0.0.2 not signed: https://api.github.com/repos/httpreserve/tikalinkextract/releases/8206169","Warn: release artifact tle-0.0.3 does not have provenance: https://api.github.com/repos/httpreserve/tikalinkextract/releases/16158288","Warn: release artifact 0.0.2 does not have provenance: https://api.github.com/repos/httpreserve/tikalinkextract/releases/8206169"],"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'main'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}}]},"last_synced_at":"2025-08-19T13:56:50.539Z","repository_id":144429474,"created_at":"2025-08-19T13:56:50.539Z","updated_at":"2025-08-19T13:56:50.539Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28492057,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-17T00:50:05.742Z","status":"online","status_checked_at":"2026-01-17T02:00:07.808Z","response_time":85,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["archives","code4lib","digitalpreservation","httpreserve","iipc","tika","tika-wrapper","url-extractor","webarchiving"],"created_at":"2025-11-15T06:00:31.912Z","updated_at":"2026-01-17T02:10:21.917Z","avatar_url":"https://github.com/httpreserve.png","language":"HTML","funding_links":["https://ko-fi.com/https://ko-fi.com/beet_keeper"],"categories":["Tools \u0026 Software"],"sub_categories":["Utilities"],"readme":"\u003c!--markdownlint-disable--\u003e\n\n\u003cdiv\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg height=\"300px\" width=\"300px\" id=\"logo\" src=\"https://github.com/httpreserve/httpreserve/raw/main/src/images/httpreserve-logo.png\" alt=\"httpreserve\"/\u003e\n\u003c/p\u003e\n\u003c/div\u003e\n\n\u003c!--markdownlint-enable--\u003e\n\n# tikalinkextract\n\nTika client for httpreserve.\n\n## About\n\nTikalinkextract requires users start the Tika HTTP server, and then it provides\na way for them to automate the batch processing of those files into its text\nextraction mechanism. The text is then processed to look for hyperlinks which\nare extracted and output to stdout. There are examples you can try below.\n\nMore information is available on the OPF website:\n[Hyperlinks in your files? How to get them out using tikalinkextract][opf-1]\n\n[opf-1]: https://openpreservation.org/blogs/hyperlinks-in-your-files-how-to-get-them-out-using-tikalinkextract/\n\n## Demo\n\n[![asciicast](https://asciinema.org/a/143271.png)](https://asciinema.org/a/143271)\n\n## Use with Wget\n\n### Extract the links from your files using seeds option\n\n```sh\n./tikalinkextract -seeds -file archives-nz-demo/ \u003e transferlinks.txt\n```\n\n### Use the seeds to generate a warc file\n\n\u003c!--markdownlint-disable--\u003e\n\n```sh\nwget -T 10 --tries=1 --page-requisites --span-hosts --convert-links  --execute robots=off --adjust-extension --no-directories --directory-prefix=output --warc-cdx --warc-file=accession --wait=0.1 --user-agent=httpreserve-wget/0.0.1 -i transferlinks.txt\n```\n\nSee [explainshell.com][explain-1]\n\n[explain-1]: https://explainshell.com/explain?cmd=wget+-T+10+--tries%3D1+--page-requisites+--span-hosts+--convert-links++--execute+robots%3Doff+--adjust-extension+--no-directories+--directory-prefix%3Doutput+--warc-cdx+--warc-file%3Daccession+--wait%3D0.1+--user-agent%3Dhttpreserve-wget%2F0.0.1+-i+transferlinks.txt\n\n\u003c!--markdownlint-enable--\u003e\n\n## Resources that might be useful\n\n* [REGEX Guru: Detecting URLS in text][regex-1]\n\n[regex-1]: http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/\n\n## License\n\nTika is licensed as [Apache License 2.0][tika-license].\n\nThis tool is licensed [GNU General Public License Version 3](LICENSE).\n\n[tika-license]: http://www.apache.org/licenses/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhttpreserve%2Ftikalinkextract","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhttpreserve%2Ftikalinkextract","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhttpreserve%2Ftikalinkextract/lists"}