{"id":39497613,"url":"https://github.com/clarin-eric/linkchecker","last_synced_at":"2026-02-22T18:03:29.558Z","repository":{"id":39904172,"uuid":"328695454","full_name":"clarin-eric/linkchecker","owner":"clarin-eric","description":null,"archived":false,"fork":false,"pushed_at":"2025-01-26T17:01:03.000Z","size":500,"stargazers_count":0,"open_issues_count":4,"forks_count":0,"subscribers_count":12,"default_branch":"main","last_synced_at":"2026-01-18T15:18:44.757Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/clarin-eric.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGES.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2021-01-11T14:46:19.000Z","updated_at":"2025-01-26T16:58:53.000Z","dependencies_parsed_at":"2024-04-14T20:45:00.479Z","dependency_job_id":null,"html_url":"https://github.com/clarin-eric/linkchecker","commit_stats":null,"previous_names":[],"tags_count":77,"template":false,"template_full_name":null,"purl":"pkg:github/clarin-eric/linkchecker","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clarin-eric%2Flinkchecker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clarin-eric%2Flinkchecker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clarin-eric%2Flinkchecker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clarin-eric%2Flinkchecker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/clarin-eric","download_url":"https://codeload.github.com/clarin-eric/linkchecker/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clarin-eric%2Flinkchecker/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29721057,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-22T15:10:41.462Z","status":"ssl_error","status_checked_at":"2026-02-22T15:10:04.636Z","response_time":110,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-18T05:43:44.595Z","updated_at":"2026-02-22T18:03:29.524Z","avatar_url":"https://github.com/clarin-eric.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Link Checker\n## Introduction\nThe Link checker is a [StormCrawler](https://github.com/DigitalPebble/storm-crawler) \nadaptation for URL checking. Instead of crawling, it checks the status of URLs and\npersists them in a database (currently MariaDB/MySQL). \n\n**Important note**  \nThe Link Checker is not a stand-alone application but storm topology which is running inside a cluster. \nOnly for testing we provide a class which runs as a stand-alone application, preferably in your IDE. But this \nshould not be run in production. \n\nFor more information on storm topologies, have a look at the documentation of the [apache storm](https://storm.apache.org/releases/2.6.0/Concepts.html) project, please. \n\n## Building and running the Link Checker topology\n### Building the Link Checker topology\n1. Clone this repository to your workspace\n1. Go inside the Link Checker directory and build a jar by calling the Maven wrapper with the command  \n`./mvnw clean install`\n\nYou may use your own Maven instead of the Maven wrapper for building the topology but the wrapper is the safe way,\nsince it is tested.\nTherefore, if anything goes wrong at build time, make sure at first that you were using the Maven wrapper.\n\n### Setting up a storm cluster\nFor remote cluster setup, have a look at the documentation of the [apache storm](https://storm.apache.org/releases/2.6.0/Setting-up-a-Storm-cluster.html) project, please.\n\n### Deploying the Link Checker topology to the cluster\nTo deploy your Link Checker topology to the cluster, use the command  \n`\u003cstorm directory\u003e/bin/storm\" jar \u003cLink Checker directory\u003e/target/linkchecker-\u003cversion\u003e.jar org.apache.storm.flux.Flux -e -r -R linkchecker.flux`\n\nFor more information on the parameters, have a look at the [Flux](https://storm.apache.org/releases/2.6.0/flux.html) chapter \nof the apache storm documentation. \n\n## Testing in local mode in your IDE\nAs mentioned before the Link Checker project provides a class to test the Link Checker in your favorite IDE in\nlocal mode without any necessity to set up a cluster.    \n1. Clone this repository into an IDE workspace\n1. Set environment the variables used in src/test/resources/linkchecker-test-conf.yaml in the IDEs application running configuration\n1. Execute class eu.clarin.linkchecker.LinkcheckerTestApp (under src/test/java)\n  \n# Simple Explanation of the current implementation\n\nOur SQL database has got these tables:\n1. **url:** This is the table that linkchecker reads the URLs to check from. So this will be populated by another application (in our case curation-module or linkchecker-api).\n1. **status:** This is the table that linkchecker saves the results into.\n1. **history:** If a URL is checked more than once, the previous checking result is saved in the history table and the record in the status table is updated.   \n1. **obsolete** A flat table which keeps the records still for a while after purging the from the other tables \n1. **providerGroup**\n1. **context**: The table saves the context (the file or the upload) in which the link is found \n1. **url_context**: Joins url-table n-n to the context table, so that each URL might appear in different contexts. Moreover the table contains the last time when the link was ingested and and a boolean flag which indicates if the join is still active. Only URLs which have at least one active join are considered to be checked!\n1. **client** The table is basically used to identify the link source\n\nThe creation script is available in the [linkchecker-persictence API](https://github.com/clarin-eric/linkchecker-persistence/blob/main/src/main/resources/schema.sql) project. \n\n*linkchecker.flux* defines the components(spouts, bolts and streams) if our topology and loads the configuration file *linkchecker-conf.yaml*.\n1. `eu.clarin.linkchecker.spout.LPASpout` uses the [linkchecker-persistence API](https://github.com/clarin-eric/linkchecker-persistence) to fill up a buffer with URLs to check.\n1. `org.apache.stormcrawler.bolt.URLPartitionerBolt` partitions the URLs by a configured criteria\n1. `eu.clarin.linkchecker.bolt.MetricsFetcherBolt` fetches the urls. It sends redirects back to URLPartitionerBolt and sends the rest onwards down the stream to StatusUpdaterBolt. Modification of  `org.apache.stormcrawler.bolt.FetcherBolt`\n1. `eu.clarin.linkchecker.bolt.StatusUpdaterBolt` persists the results in the status table of the database via the [linkchecker-persistence API](https://github.com/clarin-eric/linkchecker-persistence).\n1. `eu.clarin.linkchecker.bolt.SimpleStackBolt` persists the latest checking results into a Java Object file for use in curation-web\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclarin-eric%2Flinkchecker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fclarin-eric%2Flinkchecker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclarin-eric%2Flinkchecker/lists"}