{"id":22302592,"url":"https://github.com/dataoneorg/hashstore-java","last_synced_at":"2025-03-26T00:27:38.064Z","repository":{"id":143001565,"uuid":"614620618","full_name":"DataONEorg/hashstore-java","owner":"DataONEorg","description":"HashStore, a hash-based object store for DataONE data packages","archived":false,"fork":false,"pushed_at":"2024-11-21T18:44:16.000Z","size":1355,"stargazers_count":1,"open_issues_count":2,"forks_count":0,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-01-30T21:17:10.238Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DataONEorg.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-16T00:56:34.000Z","updated_at":"2024-11-21T18:38:02.000Z","dependencies_parsed_at":null,"dependency_job_id":"d7aa8146-240a-4860-8a3c-b6cb14eb5502","html_url":"https://github.com/DataONEorg/hashstore-java","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataONEorg%2Fhashstore-java","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataONEorg%2Fhashstore-java/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataONEorg%2Fhashstore-java/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataONEorg%2Fhashstore-java/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DataONEorg","download_url":"https://codeload.github.com/DataONEorg/hashstore-java/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245564797,"owners_count":20636179,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-03T18:39:58.774Z","updated_at":"2025-03-26T00:27:38.015Z","avatar_url":"https://github.com/DataONEorg.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"## HashStore-java: hash-based object storage for DataONE data packages\n\nVersion: 1.1.0\n- DOI: [doi:10.18739/A2ZG6G87Q](https://doi.org/10.18739/A2ZG6G87Q)\n\n## Contributors\n\n- **Author**: Dou Mok, Jing Tao, Matthew Brooke, Matthew B. Jones\n- **License**: [Apache 2](http://opensource.org/licenses/Apache-2.0)\n- [Package source code on GitHub](https://github.com/DataONEorg/hashstore-java)\n- [**Submit Bugs and feature requests**](https://github.com/DataONEorg/hashstore-java/issues)\n- Contact us: support@dataone.org\n- [DataONE discussions](https://github.com/DataONEorg/dataone/discussions)\n\n## Citation\n\nCite this software as:\n\n\u003e Dou Mok, Jing Tao, Matthew Brooke, Matthew B. Jones. 2024.\n\u003e HashStore-java: hash-based object storage for DataONE data packages. Arctic Data Center.\n\u003e [doi:10.18739/A2QF8JM59](https://doi.org/10.18739/A2QF8JM59)\n\n## Introduction\n\nHashStore-java is a server-side java library that implements an object storage file system for storing\nand accessing data and metadata for DataONE services. The package is used in DataONE system\ncomponents that need direct, filesystem-based access to data objects, their system metadata, and\nextended metadata about the objects. This package is a core component of\nthe [DataONE federation](https://dataone.org), and supports large-scale object storage for a variety\nof repositories, including the [KNB Data Repository](http://knb.ecoinformatics.org),\nthe [NSF Arctic Data Center](https://arcticdata.io/catalog/),\nthe [DataONE search service](https://search.dataone.org), and other repositories.\n\nDataONE in general, and HashStore in particular, are open source, community projects.\nWe [welcome contributions](https://github.com/DataONEorg/hashstore-java/blob/main/CONTRIBUTING.md)\nin many forms, including code, graphics, documentation, bug reports, testing, etc. Use\nthe [DataONE discussions](https://github.com/DataONEorg/dataone/discussions) to discuss these\ncontributions with us.\n\n## Documentation\n\nDocumentation is a work in progress, and can be found on\nthe [Metacat repository](https://github.com/NCEAS/metacat/blob/feature-1436-storage-and-indexing/docs/user/metacat/source/storage-subsystem.rst#physical-file-layout)\nas part of the storage redesign planning. Future updates will include documentation here as the\npackage matures.\n\n## HashStore Overview\n\nHashStore is an object storage system that provides persistent file-based storage using content\nhashes to de-duplicate data. The system stores both objects, references (refs) and metadata in its\nrespective directories and utilizes an identifier-based API for interacting with the store.\nHashStore storage classes (like `FileHashStore`) must implement the HashStore interface to ensure\nthe expected usage of HashStore.\n\n### Public API Methods\n\n- storeObject\n- tagObject\n- storeMetadata\n- retrieveObject\n- retrieveMetadata\n- deleteObject\n- deleteIfInvalidObject\n- deleteMetadata\n- getHexDigest\n\nFor details, please see the HashStore interface [HashStore.java](https://github.com/DataONEorg/hashstore-java/blob/main/src/main/java/org/dataone/hashstore/HashStore.java)\n\n### How do I create a HashStore?\n\nTo create or interact with a HashStore, instantiate a HashStore object with the following set of\nproperties:\n\n- storePath\n- storeDepth\n- storeWidth\n- storeAlgorithm\n- storeMetadataNamespace\n\n```java\nString classPackage = \"org.dataone.hashstore.filehashstore.FileHashStore\";\nPath rootDirectory = tempFolder.resolve(\"metacat\");\n\nProperties storeProperties = new Properties();\nstoreProperties.setProperty(\"storePath\", rootDirectory.toString());\nstoreProperties.setProperty(\"storeDepth\", \"3\");\nstoreProperties.setProperty(\"storeWidth\", \"2\");\nstoreProperties.setProperty(\"storeAlgorithm\", \"SHA-256\");\nstoreProperties.setProperty(\n    \"storeMetadataNamespace\", \"https://ns.dataone.org/service/types/v2.0#SystemMetadata\"\n);\n\n// Instantiate a HashStore\nHashStore hashStore = HashStoreFactory.getHashStore(classPackage, storeProperties);\n\n// Store an object\nhashStore.storeObject(stream, pid);\n// ...\n```\n\n### What does HashStore look like?\n\n```sh\n# Example layout in HashStore with a single file stored along with its metadata and reference files.\n# This uses a store depth of 3 (number of nested levels/directories - e.g. '/4d/19/81/' within\n# 'objects', see below), with a width of 2 (number of characters used in directory name - e.g. \"4d\",\n# \"19\" etc.) and \"SHA-256\" as its default store algorithm\n## Notes:\n## - Objects are stored using their content identifier as the file address\n## - The reference file for each pid contains a single cid\n## - The reference file for each cid contains multiple pids each on its own line\n## - There are two metadata docs under the metadata directory for the pid (sysmeta, annotations)\n\n.../metacat/hashstore\n├── hashstore.yaml\n└── objects\n|   └── 4d\n|       └── 19\n|           └── 81\n|               └── 71eef969d553d4c9537b1811a7b078f9a3804fc978a761bc014c05972c\n└── metadata\n|   └── 0d\n|       └── 55\n|           └── 55\n|               └── 5ed77052d7e166017f779cbc193357c3a5006ee8b8457230bcf7abcef65e\n|                   └── 323e0799524cec4c7e14d31289cefd884b563b5c052f154a066de5ec1e477da7\n|                   └── sha256(pid+formatId_annotations)\n└── refs\n    ├── cids\n    |   └── 4d\n    |       └── 19\n    |           └── 81\n    |               └── 71eef969d553d4c9537b1811a7b078f9a3804fc978a761bc014c05972c\n    └── pids\n        └── 0d\n            └── 55\n                └── 55\n                    └── 5ed77052d7e166017f779cbc193357c3a5006ee8b8457230bcf7abcef65e\n```\n\n### Working with objects (store, retrieve, delete)\n\nIn HashStore, objects are first saved as temporary files while their content identifiers are\ncalculated. Once the default hash algorithm list and their hashes are generated, objects are stored\nin their permanent location using the store's algorithm's corresponding hash value, the store depth\nand the store width. Lastly, objects are 'tagged' with a given identifier (ex. persistent\nidentifier (pid)). This process produces reference files, which allow objects to be found and\nretrieved with a given identifier.\n- Note 1: An identifier can only be used once\n- Note 2: Each object is stored once and only once using its content identifier (a checksum generated\n  from using a hashing algorithm). Clients that attempt to store duplicate objects will receive\n  the expected ObjectMetadata - with HashStore handling the de-duplication process under the hood.\n\nBy calling the various interface methods for  `storeObject`, the calling app/client can validate,\nstore and tag an object simultaneously if the relevant data is available. In the absence of an\nidentifier (ex. persistent identifier (pid)), `storeObject` can be called to solely store an object.\nThe client is then expected to call `deleteIfInvalidObject` when the relevant metadata is available to\nconfirm that the object is what is expected. And to finalize the process (to make the object\ndiscoverable), the client calls `tagObject``. In summary, there are two expected paths to store an\nobject:\n\n```java\n// All-in-one process which stores, validates and tags an object\nobjectMetadata objInfo = storeObject(InputStream, pid, additionalAlgorithm, checksum, checksumAlgorithm, objSize);\n\n// Manual Process\n// Store object\nobjectMetadata objInfo = storeObject(InputStream);\n// Validate object, if the parameters do not match, the data object associated with the objInfo\n// supplied will be deleted\ndeleteIfInvalidObject(objInfo, checksum, checksumAlgorithn, objSize);\n// Tag object, makes the object discoverable (find, retrieve, delete)\ntagObject(pid, cid);\n```\n\n**How do I retrieve an object if I have the pid?**\n\n- To retrieve an object, call the Public API method `retrieveObject` which opens a stream to the\n  object if it exists.\n\n**How do I delete an object if I have the pid?**\n\n- To delete an object and all its associated reference files, call the Public API\n  method `deleteObject()`.\n- Note, `deleteObject` and `storeObject` are synchronized processes based on a given `pid`.\n  Additionally, `deleteObject` further synchronizes with `tagObject` based on a `cid`. Every\n  object is stored once, is unique and shares one cid reference file.\n\n### Working with metadata (store, retrieve, delete)\n\nHashStore's '/metadata' directory holds all metadata for objects stored in HashStore. All metadata\ndocuments related to a 'pid' are stored in a directory determined by calculating the hash of the\npid (based on the store's algorithm). Each specific metadata document is then stored by calculating\nthe hash of its associated `pid+formatId`. By default, calling `storeMetadata` will use HashStore's\ndefault metadata namespace as the 'formatId' when storing metadata. Should the calling app wish to\nstore multiple metadata files about an object, the client app is expected to provide a 'formatId'\nthat represents an object format for the metadata type (ex. `storeMetadata(stream, pid, formatId)`).\n\n**How do I retrieve a metadata file?**\n\n- To find a metadata object, call the Public API method `retrieveMetadata` which returns a stream to\n  the metadata file that's been stored with the default metadata namespace if it exists.\n- If there are multiple metadata objects, a 'formatId' must be specified when\n  calling `retrieveMetadata` (ex. `retrieveMetadata(pid, formatId)`)\n\n**How do I delete a metadata file?**\n\n- Like `retrieveMetadata`, call the Public API method `deleteMetadata(String pid, String formatId)`\n  which will delete the metadata object associated with the given pid.\n- To delete all metadata objects related to a given 'pid', call `deleteMetadata(String pid)`\n\n### What are HashStore reference files?\n\nHashStore assumes that every object to store has a respective identifier. This identifier is then\nused when storing, retrieving and deleting an object. In order to facilitate this process, we create\ntwo types of reference files:\n\n- pid (persistent identifier) reference files\n- cid (content identifier) reference files\n\nThese reference files are implemented in HashStore underneath the hood with no expectation for\nmodification from the calling app/client. The one and only exception to this process is when the\ncalling client/app does not have an identifier, and solely stores an objects raw bytes in\nHashStore (calling `storeObject(InputStream)`).\n\n**'pid' Reference Files**\n\n- Pid (persistent identifier) reference files are created when storing an object with an identifier.\n- Pid reference files are located in HashStores '/refs/pid' directory\n- If an identifier is not available at the time of storing an object, the calling app/client must\n  create this association between a pid and the object it represents by calling `tagObject`\n  separately.\n- Each pid reference file contains a string that represents the content identifier of the object it\n  references\n- Like how objects are stored once and only once, there is also only one pid reference file for each\n  object.\n\n**'cid' Reference Files**\n\n- Cid (content identifier) reference files are created at the same time as pid reference files when\n  storing an object with an identifier.\n- Cid reference files are located in HashStore's '/refs/cid' directory\n- A cid reference file is a list of all the pids that reference a cid, delimited by a new line (\"\\n\")\n  character\n\n## Development Build\n\nHashStore is a Java package, and built using the [Maven](https://maven.apache.org/) build tool.\n\nTo install `HashStore-java` locally, install Java and Maven on your local machine,\nand then install or build the package with `mvn install` or `mvn package`, respectively.\n\nWe also maintain a\nparallel [Python-based version of HashStore](https://github.com/DataONEorg/hashstore).\n\n## HashStore HashStoreClient Usage\n\n```sh\n\n# Step 1: Get HashStore Jar file\n$ mvn clean package -Dmaven.test.skip=true\n\n# Get help\n$ java -cp ./target/hashstore-1.1.0-shaded.jar org.dataone.hashstore.HashStoreClient -h\n\n# Step 2: Determine where your hashstore should live (ex. `/var/hashstore`)\n## Create a HashStore (long option)\n$ java -cp ./target/hashstore-1.1.0-shaded.jar org.dataone.hashstore.HashStoreClient --createhashstore --storepath=/path/to/store --storedepth=3 --storewidth=2 --storealgo=SHA-256 --storenamespace=https://ns.dataone.org/service/types/v2.0#SystemMetadata\n\n## Create a HashStore (short option)\n$ java -cp ./target/hashstore-1.1.0-shaded.jar org.dataone.hashstore.HashStoreClient -chs -store /path/to/store -dp 3 -wp 2 -ap SHA-256 -nsp https://ns.dataone.org/service/types/v2.0#SystemMetadata\n\n# Get the checksum of a data object\n$ java -cp ./target/hashstore-1.1.0-shaded.jar org.dataone.hashstore.HashStoreClient -store /path/to/store -getchecksum -pid testpid1 -algo SHA-256\n\n# Store a data object\n$ java -cp ./target/hashstore-1.1.0-shaded.jar org.dataone.hashstore.HashStoreClient -store /path/to/store -storeobject -path /path/to/data.ext -pid testpid1\n\n# Store a metadata object\n$ java -cp ./target/hashstore-1.1.0-shaded.jar org.dataone.hashstore.HashStoreClient -store /path/to/store -storemetadata -path /path/to/metadata.ext -pid testpid1 -format_id https://ns.dataone.org/service/types/v2.0#SystemMetadata\n\n# Retrieve a data object\n$ java -cp ./target/hashstore-1.1.0-shaded.jar org.dataone.hashstore.HashStoreClient -store /path/to/store -retrieveobject -pid testpid1\n\n# Retrieve a metadata object\n$ java -cp ./target/hashstore-1.1.0-shaded.jar org.dataone.hashstore.HashStoreClient -store /path/to/store -retrievemetadata -pid testpid1 -format_id https://ns.dataone.org/service/types/v2.0#SystemMetadata\n\n# Delete a data object\n$ java -cp ./target/hashstore-1.1.0-shaded.jar org.dataone.hashstore.HashStoreClient -store /path/to/store -deleteobject -pid testpid1\n\n# Delete a metadata file\n$ java -cp ./target/hashstore-1.1.0-shaded.jar org.dataone.hashstore.HashStoreClient -store /path/to/store -deletemetadata -pid testpid1 -format_id https://ns.dataone.org/service/types/v2.0#SystemMetadata\n```\n\n## License\n\n```txt\nCopyright [2023] [Regents of the University of California]\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\nhttp://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n```\n\n## Acknowledgements\n\nWork on this package was supported by:\n\n- DataONE Network\n- Arctic Data Center: NSF-PLR grant #2042102 to M. B. Jones, A. Budden, M. Schildhauer, and J.\n  Dozier\n\nAdditional support was provided for collaboration by the National Center for Ecological Analysis and\nSynthesis, a Center funded by the University of California, Santa Barbara, and the State of\nCalifornia.\n\n[![DataONE_footer](https://user-images.githubusercontent.com/6643222/162324180-b5cf0f5f-ae7a-4ca6-87c3-9733a2590634.png)](https://dataone.org)\n\n[![nceas_footer](https://www.nceas.ucsb.edu/sites/default/files/2020-03/NCEAS-full%20logo-4C.png)](https://www.nceas.ucsb.edu)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdataoneorg%2Fhashstore-java","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdataoneorg%2Fhashstore-java","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdataoneorg%2Fhashstore-java/lists"}