{"id":22302595,"url":"https://github.com/dataoneorg/dataone-indexer","last_synced_at":"2025-07-29T03:33:17.159Z","repository":{"id":38069362,"uuid":"501039037","full_name":"DataONEorg/dataone-indexer","owner":"DataONEorg","description":"DataONE Indexer subsystem","archived":false,"fork":false,"pushed_at":"2024-11-21T23:17:36.000Z","size":5023,"stargazers_count":0,"open_issues_count":36,"forks_count":2,"subscribers_count":10,"default_branch":"main","last_synced_at":"2024-11-21T23:26:03.651Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DataONEorg.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-06-07T23:52:14.000Z","updated_at":"2024-07-25T19:18:48.000Z","dependencies_parsed_at":"2023-10-26T21:31:03.789Z","dependency_job_id":"4f9e0a64-63f2-443d-974f-7aa0d70fb271","html_url":"https://github.com/DataONEorg/dataone-indexer","commit_stats":null,"previous_names":[],"tags_count":58,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataONEorg%2Fdataone-indexer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataONEorg%2Fdataone-indexer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataONEorg%2Fdataone-indexer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataONEorg%2Fdataone-indexer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DataONEorg","download_url":"https://codeload.github.com/DataONEorg/dataone-indexer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227977906,"owners_count":17850475,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-03T18:39:59.461Z","updated_at":"2025-07-29T03:33:17.137Z","avatar_url":"https://github.com/DataONEorg.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DataONE Indexer\n\nAlso see [RELEASE-NOTES.md](./RELEASE-NOTES.md)\n\nThe DataONE Indexer is a system that processes index tasks created by other components. The DataONE\nIndexer comprises three main subsystems, each defined by its own helm subsystem chart:\n\n- **index-worker**: a subsystem implementing a Worker class to process index jobs in parallel\n- **rabbitmq**: a deployment of the RabbitMQ queue management system\n- **solr**: a deployment of the SOLR full text search system\n\n```mermaid\nflowchart TB\n  subgraph \"`**DataONE Indexer\n  Helm Chart**`\"\n    A(\"`Index Worker\n    *(1..n)*`\")\n    A -- sub chart --\u003e C(\"`RabbitMQ\n    *(1..n)*`\")\n    A -- sub chart --\u003e D(\"`solr\n    *(1..n)*`\")\n  end\n```\n\nClients are expected to register index task messages to be processed in the RabbitMQ queue. Upon\nstartup, the RabbitMQ workers register themselves as handlers of the index task messages. As\nmessages enter the queue, RabbitMQ dispatches these to registered workers in parallel, and workers\nin turn process each associated object and insert a new index entry into SOLR.\n\nSee [LICENSE.md](./LICENSE.md) for the details of distributing this software.\n\n## Building Docker image\n\nThe image can be built with either `docker` or `nerdctl` depending on which container environment\nyou have installed. For example using Rancher Desktop configured to use `nerdctl`:\n\n```shell\nmvn clean package -DskipTests\nnerdctl build -t dataone-index-worker:2.4.0 -f docker/Dockerfile --build-arg TAG=2.4.0 .\n```\n\nIf you are building locally for Kubernetes on rancher-desktop, you'll need to set the  namespace\nto `k8s.io` using a build command such as:\n\n```shell\nmvn clean package -DskipTests\nnerdctl build -t dataone-index-worker:2.4.0 -f docker/Dockerfile --build-arg TAG=2.4.0 \\\n         --namespace k8s.io  .\n```\n\n## Publish the image to GHCR\n\nFor the built image to be deployable in a remote kubernetes cluster, it must first be published to\nan image registry that is visible to Kubernetes. For example, we can make the published image\navailable via the GitHub Container Registry (ghcr.io) so that it can be pulled by\nKubernetes. For this to work, the image must be tagged with the ghcr.io URL, so it can be published.\nThen the image can be pushed to the registry after logging in with a suitable GitHub PAT.\n\nNote that, for the image to be associated with a particular GitHub repository, a metadata LABEL can\nbe added to the image that associates it when it is built - see this entry in the Dockerfile:\n\n```dockerfile\nLABEL org.opencontainers.image.source=\"https://github.com/dataoneorg/dataone-indexer\"\n```\n\nCommands for pushing the built image (example assuming tag is `2.4.0`):\n\n```shell\nGITHUB_PAT=\"your-own-secret-GitHub-Personal-Access-Token-goes-here\"\nTAG=2.4.0\n\nnerdctl tag dataone-index-worker:$TAG ghcr.io/dataoneorg/dataone-index-worker:$TAG\necho $GITHUB_PAT | nerdctl login ghcr.io -u DataONEorg --password-stdin\nnerdctl push ghcr.io/dataoneorg/dataone-index-worker:$TAG\n```\n\nOnce the image has been pushed, it may be private and will need to be made public and assigned to\nthe `dataone-indexer` repository if the LABEL wasn't set as described above.\n\n## Publishing the Helm Chart\n\nThe helm chart may also be published to a helm repository for use by others as a top-level\napplication deployment or as a dependency sub-chart. For example, we can publish the chart\nvia the GitHub Container Registry (ghcr.io). For this to work, the chart must contain an annotation\nto associate it with the correct repository - see this entry in [`Chart.yaml`](./helm/Chart.yaml):\n\n```yaml\n# OCI Annotations - see https://github.com/helm/helm/pull/11204\nsources:\n  - https://github.com/dataoneorg/dataone-indexer\n```\n\nThe chart must then be packaged:\n\n```shell\nhelm package ./helm\n```\n\n...which creates a zipped tar file named `dataone-indexer-{version}.tgz`, where `{version}` reflects\nthe chart `version` in [`Chart.yaml`](./helm/Chart.yaml). The chart can then be published to the\ncorrect ghcr.io URL, after logging in with a suitable GitHub PAT:\n\n(example assumes the chart version is 0.5.0)\n```shell\nGITHUB_PAT=\"your-own-secret-GitHub-Personal-Access-Token-goes-here\"\nTAG=2.4.0\nhelm push  dataone-indexer-0.5.0.tgz  oci://ghcr.io/dataoneorg/charts\n```\nNOTE the use of **charts** in the oci url, in order to distinguish helm charts from docker images.\n\n## Deploying the application via Helm\n\nHelm provides a simple mechanism to install all application dependencies and configure the\napplication in a single command. To deploy using helm to a release named `d1index` and also in a\nnamespace named `d1index`, and then view the deployed pods and services, use a sequence like:\n\n```shell\nkubectl create namespace d1index\n\n# to use the local files\nhelm install -n d1index d1index ./helm\n\n# or to pull the packaged chart with a specific release version (e.g. 1.2.0)...\nhelm install -n d1index d1index oci://ghcr.io/dataoneorg/charts/dataone-indexer --version 1.2.0\n\nkubectl -n d1index get all\n```\n\nand to uninstall the helm release, use:\n\n```shell\nhelm -n d1index uninstall d1index\n```\n\nNote that this helm chart also installs rabbitmq and solr, which can be partially configured\nthrough the values.yaml file in the parent chart through exported child properties.\n\n\u003e [!IMPORTANT]\n\u003e Make sure the RabbitMQ queue is empty, before upgrading or installing a new chart version for the\n\u003e first time, because each new chart version will create a new PV/PVC where the queue is stored.\n\u003e This can be overridden by setting .Values.rabbitmq.nameOverride to the same name as the previous\n\u003e version, but this is NOT recommended, since the RabbitMQ installation then becomes an upgrade\n\u003e instead of a fresh install, and may require some manual intervention.\n\n### Authentication Notes\n\n#### DataONE Authentication Token\n\nIn order to access and index private datasets on a Metacat instance, the dataone-indexer needs an\nauthentication token, which may be obtained from DataONE administrators (see the [Metacat Helm \nREADME](https://github.com/NCEAS/metacat/blob/develop/helm/README.md#setting-up-a-token-and-optional-ca-certificate-for-indexer-access)).\nUpon startup, the indexer expects to find a Kubernetes Secret named:\n`{{ .Release.Name }}-indexer-token`, which contains the auth token associated with the key \n`DataONEauthToken`. The indexer can operate without this Secret, but will only be able to index \npublic-readable datasets.\n\n#### RabbitMQ\n\nThe rabbitmq service runs under the username and password that are set via values.yaml\n\n```yaml\nrabbitmq:\n  auth:\n    username: rmq\n    existingPasswordSecret: \"\"      ## (must contain key: `rabbitmq-password`)\n```\n\n...where `existingPasswordSecret` is the name of a Kubernetes secret that contains the password,\nidentified by a key named `rabbitmq-password`. \n\n\u003e **NOTE:** it appears that this information is cached\non a PersistentVolumeClaim that is created automatically by rabbitmq. If the credentials are changed\nin `values.yaml` and/or the secret, therefore, authentication will fail because they will conflict\nwith the cached values in the PVC. If you are just testing, the problem can be resolved by deleting\nthe PVC. In production, the PVC would also be used for maintaining durable queues, and so it may not\nbe reasonable to delete the PVC. You can get the name and identifiers of the PVCs with\n`kubectl -n d1index get pvc`.\n\n## Running the IndexWorker in the docker container\n\nThe docker image assumes that the deployment configuration file exists to configure endpoint\naddresses and credentials. To run the indexer, ensure that the`DATAONE_INDEXER_CONFIG` is set in the\nenvironment and contains the absolute path to the configuration file for the indexer. This path must\nbe accessible in the container, so you will likely want to mount a volume to provide the edited\nproperties file. You can then run it using a command like:\n\n```shell\nnerdctl run -it \\\n    -e DATAONE_INDEXER_CONFIG=/var/lib/dataone-indexer/dataone-indexer.properties \\\n    -v `pwd`/helm/config/dataone-indexer.properties:/var/lib/dataone-indexer/dataone-indexer.properties \\\n    dataone-index-worker:2.4.0\n```\n\n## A Note on SOLR Authentication\n\nThe helm installation does not currently configure solr with authentication enabled, since the\nservice is not exposed outside the Kubernetes cluster. Mentions of logins in the following sections\ncan therefore be ignored. However, this should be changed to use authentication if connecting to a\nsolr instance outside the cluster.\n\n## Checking if SOLR is configured\n\nLogging in using the SOLR_AUTHENTICATION_OPTS and SOLR_AUTH_TYPE env variables (if applicable)\nallows the `solr` command to be executed to check the server status:\n\n```shell\n$ export SOLR_AUTH_TYPE=basic\n$ export SOLR_AUTHENTICATION_OPTS=\"-Dbasicauth=${SOLR_ADMIN_USERNAME}:${SOLR_ADMIN_PASSWORD}\"\n$ solr status -z ${SOLR_ZK_HOSTS} -c ${SOLR_COLLECTION}\n\nFound 1 Solr nodes:\n\nSolr process 8 running on port 8983\n{\n  \"solr_home\":\"/opt/bitnami/solr/server/solr\",\n  \"version\":\"9.0.0 a4eb7aa123dc53f8dac74d80b66a490f2d6b4a26 - janhoy - 2022-05-05 01:00:08\",\n  \"startTime\":\"2022-10-11T07:08:50.155Z\",\n  \"uptime\":\"0 days, 0 hours, 21 minutes, 52 seconds\",\n  \"memory\":\"70.9 MB (%13.8) of 512 MB\",\n  \"cloud\":{\n    \"ZooKeeper\":\"d1index-zookeeper:2181/solr\",\n    \"liveNodes\":\"3\",\n    \"collections\":\"1\"}}\n```\n\n# SOLR Dashboard\n\nOnce the SOLR server is up and running, connect to the SOLR Dashboard by creating a kube proxy, and\nthen browse to the local address:\n\n```shell\nk8 port-forward -n d1index service/d1index-solr 8983:8983 \u0026 echo \"Solr URL: 127.0.0.1:8983/solr/\"\n```\n\nYou'll need to log in with the helm-configured SOLR admin user and password, if applicable.\n\nOnce the proxy is set up, you can also run API calls from the [ConfigSet API](https://solr.apache.org/guide/6_6/configsets-api.html) and\n[Collections API](https://solr.apache.org/guide/6_6/collections-api.html).\n\n```shell\ncurl -u ${SOLR_ADMIN_USERNAME}:${SOLR_ADMIN_PASSWORD} http://localhost:8983/solr/admin/configs?action=CREATE\\\u0026name=dataone-index --header \"Content-Type:text/xml\" -X POST -d @dataone-index.zip\n{\n  \"responseHeader\":{\n    \"status\":0,\n    \"QTime\":5974}}\ncurl -u ${SOLR_ADMIN_USERNAME}:${SOLR_ADMIN_PASSWORD} http://localhost:8983/solr/admin/configs?action=list\ncurl -u ${SOLR_ADMIN_USERNAME}:${SOLR_ADMIN_PASSWORD} http://localhost:8983/solr/admin/collections?action=list\n```\n\n### Admin tools for rabbitmq\n\nOnce rabbitmq is configured, the web console can be accessed by port-forwarding.\n\n```shell\nk8 -n jones port-forward pod/d1index-rabbitmq-0 15672:15672 \u0026\n```\n\nthen login to the Rabbitmq web console: http://localhost:15672\n\nYou can also download a copy of `rabbitmqadmin` from http://localhost:15672/cli/rabbitmqadmin,\nand the `rabbitmqadmin` command can be used to interact with the server. First, you need to set up a\nconfig file for `rabbitmqadmin` that provides some default values:\n\n```shell\n$ cat rmq.conf\n[default]\nhostname = d1index-rabbitmq-headless\nport = 15672\nusername = rmq\npassword = your-client-pw-here\ndeclare_vhost = / # Used as default for declare / delete only\nvhost = /         # Used as default for declare / delete / list\n```\n\n- List exchanges and queues\n    - `rabbitmqadmin -c rmq.conf -N default -U rmq -p $RMQPW list exchanges --vhost=/`\n    - `rabbitmqadmin -c rmq.conf -N default -U rmq -p $RMQPW list queues --vhost=/`\n- Declare exchanges, queues, and bindings\n    - `rabbitmqadmin -c rmq.conf -N default declare exchange name=testexchange type=direct -U rmq -p $RMQPW --vhost=/`\n    - `rabbitmqadmin -c rmq.conf -N default declare queue name=testqueue type=direct -U rmq -p $RMQPW --vhost=/`\n    - `rabbitmqadmin -c rmq.conf -N default -U rmq -p $RMQPW declare binding source=testexchange destination=testqueue routing_key=testqueue --vhost=/`\n- Publish a bunch of messages to a queue\n```\nfor n in $(seq 1 30); do echo $n; rabbitmqadmin -c rmq.conf -N default -U rmq -p $RMQPW publish exchange=testexchange routing_key=testqueue payload=\"Message: ${n}\" --vhost=/; done\n```\n\n## Switching the Storage System\nThe Dataone Indexer can be configured to use different storage systems by setting the environmental\nvariable `DATAONE_INDEXER_OBJECT_MANAGER_CLASS_NAME`.\nBy default, this variable is not set, and the indexer uses\n`org.dataone.cn.indexer.object.hashstore.HashStoreObjManager`, which enables support for Hashstore.\nTo use the legacy storage system instead, set the variable to\n`org.dataone.cn.indexer.object.legacystore.LegacyStoreObjManager`.\n\n## History\n\nThis is a refactored version of the original DataONE [d1_cn_index_processor](https://github.com/DataONEorg/d1_cn_index_processor) that runs\ncompletely independently of other DataONE Coordinating Node services. It is intended to be deployed\nin a Kubernetes cluster environment, but is written such that it can also be deployed in other\nenvironments as needed.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdataoneorg%2Fdataone-indexer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdataoneorg%2Fdataone-indexer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdataoneorg%2Fdataone-indexer/lists"}