{"id":19407435,"url":"https://github.com/observingclouds/tape_archive_index","last_synced_at":"2025-04-24T09:31:34.984Z","repository":{"id":58040348,"uuid":"521453667","full_name":"observingClouds/tape_archive_index","owner":"observingClouds","description":"Collection of reference files of data archived on 📼 at DKRZ","archived":false,"fork":false,"pushed_at":"2024-03-15T21:08:24.000Z","size":61823,"stargazers_count":3,"open_issues_count":1,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-03T02:22:55.065Z","etag":null,"topics":["car","ipfs","tape-archive","tar","zarr"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/observingClouds.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-08-05T00:25:22.000Z","updated_at":"2023-06-01T22:13:15.000Z","dependencies_parsed_at":"2023-02-19T04:20:26.485Z","dependency_job_id":"a4acbcfb-b8b5-4f3e-ac5d-62414c26b2f4","html_url":"https://github.com/observingClouds/tape_archive_index","commit_stats":{"total_commits":150,"total_committers":2,"mean_commits":75.0,"dds":0.00666666666666671,"last_synced_commit":"6eab85cb804b76d9664069973cdf2b26017cb214"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/observingClouds%2Ftape_archive_index","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/observingClouds%2Ftape_archive_index/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/observingClouds%2Ftape_archive_index/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/observingClouds%2Ftape_archive_index/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/observingClouds","download_url":"https://codeload.github.com/observingClouds/tape_archive_index/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250600704,"owners_count":21457012,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["car","ipfs","tape-archive","tar","zarr"],"created_at":"2024-11-10T11:47:15.740Z","updated_at":"2025-04-24T09:31:34.712Z","avatar_url":"https://github.com/observingClouds.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Collection of references to data archived on 📼 at DKRZ\n![Check](https://github.com/observingclouds/tape_archive_index/actions/workflows/test.yml/badge.svg) [![Reference files](https://img.shields.io/badge/reference%20files-10.5281%2Fzenodo.7017188-blue)](https://doi.org/10.5281/zenodo.7017188)\n\nRepository containing parquet-reference files to `zarr`-files packed as `car` and `tar`-collections and stored on tape.\n\n## Application\nThis repository contains a look-up table of CIDs of data that is saved on the DKRZ tape archive. A user who is interested in working with the data behind a specific CID should however first try to get the content via the IPFS network or the resources given in the [EUREC4A-Intake catalog](https://github.com/eurec4a/eurec4a-intake) (currently only EUREC4A simulations are referenced here). If the dataset cannot be found, the steps described below can be followed to retrieve the data from the tape archive (access rights necessary).\n\nThis repository also offers the possibility to integrate the tape archive into the IPFS network by providing the interface between the content identifiers and the archives on tape that would need to be loaded onto an IPFS node.\n\n## Intake catalog\n\n### Setup\n```python\nslk_cache = \"/scratch/m/m300408/retrieval/\" # define slk cache directory\ncatalog = \"https://raw.githubusercontent.com/observingClouds/tape_archive_index/main/catalog.yml\"\n\nimport os\nos.environ[\"SLK_CACHE\"] = slk_cache \n```\n\n### Open catalog with all available/indexed datasets\n```python\nfrom intake import open_catalog\ncat=open_catalog(catalog)\nsorted(list(cat))\n```\n\n```python\n['EUREC4A_ICON-LES_control_DOM01_radiation_native',\n 'EUREC4A_ICON-LES_control_DOM01_reff_native',\n 'EUREC4A_ICON-LES_control_DOM01_surface_native',\n 'EUREC4A_ICON-LES_control_DOM02_3D_native.qr+cloud_num+coords',\n 'EUREC4A_ICON-LES_control_DOM02_reff_native',\n 'EUREC4A_ICON-LES_control_DOM02_surface_native',\n...]\n```\n\n### Select dataset of interest\n```python\nds=cat[\"EUREC4A_ICON-LES_control_DOM01_surface_native\"].to_dask()\n```\nThe required files for any computations will be retrieved from tape when needed and cached locally.\n\nNote: the package [`slkspec`](https://github.com/observingClouds/slkspec) needs to be installed in addition to the general intake requirements.\n\n## Downloading the archived files manually\nAnother option is to download the archived files manually from tape. This is currently the preferred option if large portions of the dataset are needed, because retrievals are more sufficiently grouped together.\n\n1. Get the Content-Identifier (CID) of the data of interest\n    - e.g. from a source like the eurec4a-intake catalog, or\n    - open archived_cids.json and copy the CID of interest\n    ```python\n    cid = \"bafybeibk4i64g6vku2rk4ap5wrrw2b3ryrr3n274vris5dmo25vuf4k3pu\"\n    ```\n2. Load the according reference file\n    ```python\n    import json\n    import pandas as pd\n    with open(\"archived_cids.json\") as f:\n        cids = json.load(f)\n    metadata = cids[cid]\n    references = pd.read_parquet(metadata[\"preffs\"])\n    ```\n3. Get a list of referenced files that contain the actual data\n    These will be in most cases `car` files\n    ```python\n    files_to_retrieve = pd.unique(references.path)\n    files_to_retrieve = [f for f in files_to_retrieve if isinstance(f,str)]\n    ```\n4. Retrieve files from tape\n    Note that the following steps are not possible on the login node and another partition has to be chosen with e.g. `salloc --partition=interactive --mem=6GB --nodes=1 --time=02:00:00 --account \u003cACCOUNT\u003e`\n    ```python\n    import re\n    import subprocess\n    import numpy as np\n    \n    target_dir = \"/scratch/m/mXXXXXX/\"\n    path_on_tape = metadata[\"tape_archive_prefix\"]\n\n    def create_search_pattern(files):\n        \"\"\"Create simple regexp from given list of files\n\n        \u003e\u003e\u003e files = ['file001.txt', 'file002.txt', 'file100.txt']\n        \u003e\u003e\u003e create_search_pattern(files)\n        'file001.txt|file002.txt|file100.txt'\n        \"\"\"\n        if isinstance(files, str):\n            return files\n        else:\n            return '|'.join(files)\n    \n    def search(path_on_tape, regex):\n        \"\"\"Search for given regex on tape and return search id\n        \"\"\"\n        search_instruction = '{\"$and\":[{\"path\":{\"$gte\":\"'+path_on_tape+'\",\"$max_depth\":1}},{\"resources.name\":{\"$regex\":\"'+regex+'\"}}]}'\n        result = subprocess.check_output(f\"module load slk; slk search '{search_instruction}'\", shell=True).decode()\n        id_idx = result.find('Search ID:')\n        search_id = int(''.join(re.findall(r\"[0-9]\", result[id_idx:])))\n        return search_id\n\n    def ensure_preferred_sharding(dir):\n        \"\"\"Ensure preffered sharding of target directory is set\n        \"\"\"\n        subprocess.call(f\"lfs setstripe -E 1G -c 1 -S 1M -E 4G -c 4 -S 1M -E -1 -c 8 -S 1M {dir}\", shell=True)\n    \n    regex = create_search_pattern(files_to_retrieve)\n    search_id = search(path_on_tape,regex)\n    ensure_preffered_sharding(target_dir)\n    \n    subprocess.check_output(f\"module load slk; slk retrieve -s {search_id} {target_dir}\")\n    ```\n\n5. Open the reference filesystem\n    ```python\n    import xarray as xr\n    storage_options = {\"preffs\":{\"prefix\":\"/path/to/directory/with/car/files/\"}}\n    ds = xr.open_zarr(f\"preffs::{metadata[\"preffs\"]}\", storage_options=storage_options)\n    ```\n\n## Upload entry to zenodo\nThe reference files are currently stored on zenodo.\n\nTo upload or update a new file to zenodo please contact the maintainer of this repository by opening an issue or pull request. While the reference files can be uploaded to any server and any zenodo repository, we try to keep them all in one place. If you have access to the zenodo repository you find instructions on how to upload a new file to zenodo [here](https://developers.zenodo.org). Basically, you need to\n1. Create an `ACCESS_TOKEN`\n2. Create a new version of the zenodo dataset.\n3. Grep the record number of the new version of the dataset, i.e. the last number in the url, e.g. `7485057` from https://zenodo.org/deposit/7485057\n4. Find out the bucket link: e.g. `curl https://zenodo.org/api/deposit/depositions/7485057?access_token=$ACCESS_TOKEN | jq '.links.bucket'`\n5. Upload the file(s), with `curl --upload-file $LOCAL_FILENAME https://zenodo.org/api/files/25794c67-d85e-45a7-b3cf-032578603fa9/$REMOTE_FILENAME?access_token=$ACCESS_TOKEN`\n6. Publish the dataset and get the links to the newly added file(s). Note, these links are not the same as the one used above for the upload.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fobservingclouds%2Ftape_archive_index","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fobservingclouds%2Ftape_archive_index","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fobservingclouds%2Ftape_archive_index/lists"}