{"id":39496881,"url":"https://github.com/clarin-eric/harvest-manager","last_synced_at":"2026-01-18T05:43:11.403Z","repository":{"id":14864864,"uuid":"17588278","full_name":"clarin-eric/harvest-manager","owner":"clarin-eric","description":"A simple Java application for managing an OAI-PMH harvesting workflow","archived":false,"fork":false,"pushed_at":"2025-10-26T15:59:03.000Z","size":1699,"stargazers_count":14,"open_issues_count":20,"forks_count":12,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-10-26T17:48:04.807Z","etag":null,"topics":["cmdi","oai-pmh","vlo"],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/clarin-eric.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGES.md","contributing":null,"funding":null,"license":"LICENSE-gpl-3.0.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2014-03-10T09:56:56.000Z","updated_at":"2025-10-26T15:59:07.000Z","dependencies_parsed_at":"2025-09-10T01:53:12.409Z","dependency_job_id":"b953debd-abf1-448a-bd0b-9ef842ad93d6","html_url":"https://github.com/clarin-eric/harvest-manager","commit_stats":null,"previous_names":["clarin-eric/harvest-manager"],"tags_count":14,"template":false,"template_full_name":null,"purl":"pkg:github/clarin-eric/harvest-manager","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clarin-eric%2Fharvest-manager","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clarin-eric%2Fharvest-manager/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clarin-eric%2Fharvest-manager/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clarin-eric%2Fharvest-manager/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/clarin-eric","download_url":"https://codeload.github.com/clarin-eric/harvest-manager/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clarin-eric%2Fharvest-manager/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28531366,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-18T00:39:45.795Z","status":"online","status_checked_at":"2026-01-18T02:00:07.578Z","response_time":98,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cmdi","oai-pmh","vlo"],"created_at":"2026-01-18T05:43:11.328Z","updated_at":"2026-01-18T05:43:11.393Z","avatar_url":"https://github.com/clarin-eric.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"harvest-manager\n===================\n\nThe Harvest Manager is a Java application for managing (OAI-PMH)\nharvesting. It is intended to allow definition of a harvesting\nworkflow (involving OAI harvesting and subsequent operations like\ntransformations or mappings of metadata between schemata) in a few\nminutes using a configuration file only.\n\nThis application contains a modified version of the \n[OCLC harvester2 library](https://github.com/OCLC-Research/oaiharvester2)\n([license](http://www.apache.org/licenses/LICENSE-2.0.html)),\nwhich implements the OAI-PMH requests.\n\nSince version 2.0 its possible to extent the harvester with other protocols than OAI-PMH.\n\n\n# Basic Glossary\n\nIn OAI-PMH, an individual metadata datum is called a\n**record**. Clients, such as this application, that fetch records are\ncalled **harvesters**. The server application from which records are\nobtained is called a **provider**. The base URL of the provider\n(i.e. the request URL without any parameters) is also called an\nOAI-PMH **endpoint**.\n\n\n# Building\n\nBuilding this app requires JDK 11 and Apache Maven. It can be built\nsimply using the command:\n\n```mvn clean install```\n\nIf you use a Java IDE, it is highly likely it also offers a simple way\nto do the above.\n\nYou can also use the `build.sh` script to run a build within an environment\nprovisioned with suitable versions of the JDK and Maven. Requires docker.\n\nThe above build process creates a package named\n`target/oai-harvest-manager-x.y.z.tar.gz` (where x.y.z is a version number).\n\n# Running the Application\n\nThere are no installation instructions to speak of: simply unpack the\nabove package into wherever you like. Be sure the system can find java\nhowever. The deployment package contains a script to start the app,\n`run-harvester.sh` (for Unix systems including Mac OS X; we can add a\nWindows batch file if anyone wants it). The simplest usage is:\n\n```run-harvester.sh config.xml```\n\nwhere `config.xml` is the configuration file you wish to use.\nAdditionally, parameters can be defined on the command line. For\nexample:\n\n```run-harvester.sh timeout=30 config.xml```\n\nwill set the connection timeout to 30 seconds. This value will\noverride the timeout value defined in `config.xml`, if any. The first\nparameter that does not contain = is taken as the configuration file\nname.\n\nIf you used `build.sh` to run a build you can use `run.sh config.xml` to run this build\n\n\n# Configuration\n\nThe behaviour of the app is determined by a single configuration\nfile. The configuration file is composed of four sections:\n\n- *settings*, where options such as directory paths and timeouts are\n   set;\n- *directories*, where output paths are defined;\n- *actions*, the most complex section, where actionSequences of actions can\n   be defined for different metadata formats (actions include semantic\n   transformations and saving intermediary or final results into a\n   file); and\n- *providers*, where endpoints for the providers to be harvested are\n   listed.\n\nTo get a clear idea of the structure of the configuration file, see\nthe [sample configuration files](src/main/resources) or the \n[CLARIN configuration files](https://github.com/clarin-eric/oai-harvest-config) in \njuxtaposition with the explanation for each section below.\n\n## Configuring Settings\n\nThe configuration parameters in this section govern the working\ndirectory (all output directories will be interpreted relative to it);\nconnection limits including retry count, connection delay and timeout;\nthread control settings, including the resource pool size (which can\nbe reduced to lessen memory footprint, or increased to speed up\nprocessing if resources are plentiful); and settings related to\nincremental harvesting.\n\nSet the `dry-run` setting to `true` to run the harvester without making\nthe actual harvest requests to the OAI-PMH endpoints.\n\n## Configuring Directories\n\nThe output paths listed in this section must each be given a unique\nidentifier. Additionally, the `max-files` attribute can be used to set\na limit on the number of files in a single directory. If this is\nnon-zero, subdirectories will be created in such a way that each\nsubdirectory has at most `max-files` files in it. The usefulness of\nthis setting largely depends on the total number of records you expect\nto store in a single directory and the file system used.\n\n## Configuring Actions\n\nMultiple action actionSequences can be defined in this section. Each\nsequence corresponds to a format specification followed by a number of\nsequential actions.\n\nThe **format definition** is made up of a match type (attribute\n*match*) and match value (attribute *value*). The match type is one of\n```prefix```, which simply specifies and OAI-PMH metadata prefix,\n```namespace```, and ```schema```. When one of the latter two types is\nused, the harvest manager will contact the provider with a\n```ListMetadataFormats``` query and choose *all* metadata prefixes\nthat correspond to the specified namespace or schema.\n\nThe **actions** are manipulations of one or more metadata records, each of\nwhich operates on the result of the previous action. A number of\naction types are available:\n\n- The *save* action stores the record in a new file in a specified\n  output directory, specified by an identifier matching one of the\n  directories defined in the previous section. The attribute *suffix*\n  can be used to specify the file extension (the most typical value\n  being ```suffix=\".xml\"`). If the attribute *group-by-provider* is\n  specified, a separate subdirectory will be created for each\n  endpoint. By setting history param operation  will created history file.\n\n- The *split* action split a OAI-PMH envelope that contains multiple records\n  into individual record. It retains the part of the OAI-PMH envelope that\n  is specific for the record, such as the date it was fetched\n  and its OAI-PMH identifier,  and the actual metadata record itself.\n\n- The *strip* action removes the OAI-PMH envelope and retains only the\n  actual metadata record. Note that the envelope contains information\n  not found within the record itself, such as the date it was fetched\n  and its OAI-PMH identifier.\n\n- The *transform* action applies a mapping, defined in an XSLT file,\n  to the metadata record. This can be used, among other things, for\n  semantic mapping between metadata schemata. See the included\n  configuration files for an example. The XSLT recieves various parameters:\n  1. ```config``` the configuration file used\n  2. ```provider_name``` the provider name\n  3. ```provider_uri``` the endpoint\n  4. ```record_identifier``` the id of the record to transform\n\nFor each provider, the first format definition that the provider\nsupports will determine the action sequence to be executed. If one of\nthe actions in a sequence fails, the subsequent actions are not\ncarried out and an error message is logged (but processing of any\nother metadata record is unaffected).\n\n## Configuring Providers\n\nFor each provider, the following can be defined:\n\n- The *url* attribute (mandatory) specifies the endpoint. Any URL\n  parameters (for example, `?verb=Identify` is commonly included\n  when endpoint addresses are discussed) are unnecessary and will be\n  stripped off automatically.\n\n- The *name* attribute specifies the name to use for the provider\n  (which may in turn determine file paths, depending on other\n  settings). If no name is specified, the provider will be contacted\n  and the name from its `Identify` response used. If no valid response\n  is received within a reasonable time, a generic string like\n  **Unnamed provider at oai.xyz.org** is used instead.\n\n- The attribute *static*, when set to true, indicates that the\n  provider is static. See the section below on static providers for\n  details.\n\n- Some of the global configuration options (retry count, connection\n  delay and timeout) can be overwritten for a specific provider by\n  adding them as attributes to the provider element. \n\n- The attribute *exclusive*, when set to true, indicates that the\n  provider should be harvested on its own, i.e. no other harvesting threads \n  should be active, this can be used when a provider has some huge records.\n\n- The provider element may contain multiple *set* child elements,\n  which specify the names of OAI-PMH sets to be harvested.\n\nThere is also a special case where provider names may be imported from\na *centre registry*. So far, this registry is only used by the CLARIN community.\nThe registry is specified by its URL. All the provider endpoints defined in the\nregistry will be harvested. Sometimes, it might be necessary to exclude an\nendpoint from the ones defined in the registry. This can be done by specifying\nits URL in the configuration file used for harvesting. In other cases\nan endpoint loaded from the registry needs its specific configuration timeout,\nthis can be done in a similar vain as excluding. Please review the\ninstructions in the configuration files supplied in the package. \n\n# Static Providers\n\nThis app provides support for a special case: harvesting directly from\na *static* provider, as defined in the [OAI static repository\nguidelines](http://www.openarchives.org/OAI/2.0/guidelines-static-repository.htm).\n\nEssentially, a static repository is a provider that only has to make\navailable a single XML file which contains all of their records. The\nmethod intended by the OAI-PMH family of standards for dealing with\nthis situation is that the static repository uses a *gateway* to\nintermediate access, so that harvesters may access their metadata via\nstandard OAI-PMH requests through the gateway. The OAI Harvest Manager\nallows direct harvesting of the XML file, bypassing any\nintermediary. This allows harvesting in a very efficient manner, as\nonly a single file needs to be transferred in place of possibly\nthousands of individual OAI-PMH requests.\n\nPlease note that this type of use is beyond the scope of the OAI-PMH\nstandard and should be viewed as an option for implementation\nefficiency that sacrifices some compliance with standards.\n\nTo use a static provider, specify the URL of the XML file as the\nendpoint and set the attribute *static* for that provider in the\nconfiguration file to true. Records harvested from static providers\nonly have a minimal envelope that includes datestamp (of the record)\nand identifier but excludes request specific attributes such as\nresponse datestamps.\n\n# Logging\n\nThe harvester will create the directory 'log' in which log files will reside.\nAlternatively, you can specify a directory for these by defining the LOG_DIR\nenvironment variable. A log file per provider will be created, which is\nconvenient for debugging specific providers.\n\n# Implementation Notes\n\nProcessing for each provider runs in a separate thread. It is not\npossible to target a single provider with multiple threads (except in\nthe special case where sets are used; then it is possible to mention\nthe provider multiple times in the provider list, each with different\nset(s), and the multiple references to the same provider will then be\ntreated like different providers).\n\nFor efficiency, thread pools containing prepared action objects are\nconstructed for each action referenced in the actions section of the\nconfiguration file. Different action actionSequences share the same pool for\nthe exact same action. Consider the following example, assuming that\nthe configuration parameter *resource-pool-size* is set to 5:\n\n```xml\n    \u003cformat match=\"namespace\" value=\"http://www.clarin.eu/cmd/\"\u003e\n      \u003caction type=\"save\" dir=\"orig\"/\u003e\n      \u003caction type=\"strip\"/\u003e\n      \u003caction type=\"save\" dir=\"cmdi\" history=\"true\"/\u003e\n    \u003c/format\u003e\n    \u003cformat match=\"prefix\" value=\"olac\"\u003e\n      \u003caction type=\"save\" dir=\"orig\"/\u003e\n      \u003caction type=\"strip\"/\u003e\n      \u003caction type=\"save\" dir=\"olac\" group-by-provider=\"false\"/\u003e\n    \u003c/format\u003e\n```\n\nIn this case, a total of 15 objects are pooled for the save actions: 5\nfor saving to the directory ```orig``` in a pool shared by the two\naction actionSequences, and 5 each for the directories ```cmdi``` and\n```olac```, only used by one action sequence each.\n\nThe pooling implementation is particularly important when\ntransformations are used, as preparing a transformation object\ninvolves parsing the XSLT, potentially a time-consuming process.\n\n# Extensions\n\nSince 2.0 it is possible to go beyond the OAI protocol and the builtin actions. To do so Java mrelection is used.\n\n## Protocols\n\nTo add a new protocol the Protocol interface at\n[nl.mpi.oai.harvester.protocol.Protocol](src/main/java/nl/mpi/oai/harvester/protocol/Protocol.java) has to be implemented. In the configuration one can tell the manager which protocol to load, e.g.\n\n```xml\n\u003cconfig\u003e\n  ...\n  \u003cprotocol\u003enl.mpi.oai.harvester.protocol.NdeProtocol\u003c/protocol\u003e\n  ...\n\u003c/config\u003e\n```\n\n## Actions\n\nTo add a new action the Action interface at\n[nl.mpi.oai.harvester.action.Action](src/main/java/nl/mpi/oai/harvester/action/Action.java) has to be implemented. In the configuration one can tell the manager which action to load, e.g.\n\n```xml\n\u003cconfig\u003e\n  ...\n\u003cactions\u003e\n    \u003cformat match=\"type\" value=\"*\"\u003e\n      ...\n      \u003caction type=\"nl.mpi.oai.harvester.action.NDESplitAction\"/\u003e  \n      ...\n    \u003c/format\u003e\n    ...\n\u003c/config\u003e\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclarin-eric%2Fharvest-manager","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fclarin-eric%2Fharvest-manager","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclarin-eric%2Fharvest-manager/lists"}