https://github.com/observingclouds/tar_referencer

Tar reference filesystem, e.g. tar Zarr-files and still access them with ease
https://github.com/observingclouds/tar_referencer

archiving tar zarr

Last synced: 8 months ago
JSON representation

Tar reference filesystem, e.g. tar Zarr-files and still access them with ease

Host: GitHub
URL: https://github.com/observingclouds/tar_referencer
Owner: observingClouds
License: apache-2.0
Created: 2022-10-29T10:53:40.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2023-02-22T19:28:22.000Z (over 2 years ago)
Last Synced: 2025-01-07T13:48:12.846Z (9 months ago)
Topics: archiving, tar, zarr
Language: Python
Homepage:
Size: 27.3 KB
Stars: 2
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.rst
- License: LICENSE

Awesome Lists containing this project

README

          # Converting tar archives into a reference filesystem

Zarr files can challenge metadata-server of HPC systems due to their millions of files.

One way to circumvent this challenge is to collect all files in a file container, e.g. in tar files

and create a look-up table of byte ranges where the content of each file is saved within

the container. Tar-ing zarr files makes it also easy to store and reuse data on tape-archives.

*tar_referencer* creates these look-up tables that can be used with the [preffs package](https://github.com/d70-t/preffs).

## Usage

The package can be installed with

```

pip install git+https://github.com/observingClouds/tar_referencer.git

```

The look-up files (parquet reference files) are created with

```python

tar_referencer -t file.*.tar -p file_index.preffs

```

If zarr files have been packed into tars and indexed with *tar_referencer* the tars can be opened with:

```python

import xarray as xr

storage_options={"preffs":{"prefix":/path/to/tar/files/"}}

ds = xr.open_zarr("preffs::file_index.preffs", storage_options=storage_options)

```

### Creating tar files

Technically all sorts of tar files can be referenced. However, *tar_referencer* currently does only supports tar files that are split at the file level. Tar files that are split within the header or data block are not supported.

> **Warning**

> This does not work:

> ```

> tar -cvf - big.tar | split --bytes=32000m --suffix-length=3 --numeric-suffix - part%03d.tar

> ```

To generate compatible tar files from zarr files or other directory structures, *tar_referencer* provides `tar_creator`:

```

tar_creator -i dataset.zarr -t dataset_part{:03d}.tar -s MAX_SIZE_BYTES

```

where `MAX_SIZE_BYTES` is the maximum size of a tar file, before writing further output to an additional archive.

To split already existing tar files, [Splitar](https://github.com/monoid/splitar) has been successfully tested.

```

splitar -S 32000m big.tar part.tar-

```

## Tips and tricks

For very big zarr-datasets, especially those that contain several variables, it might be advisable to pack each variable-subfolder

of the zarr file into their own set of tars. The benefit of this approach is that only those tars need to be downloaded/retrieved that

are actually containing the variable of interest. For each of these sets a separate look-up table can be generated and merged to an overaching look-up

table containing the entire dataset

```python

import pandas as pd

df_coords = pd.read_parquet("file_index.coords.preffs")

df_var1 = pd.read_parquet("file_index.var1.preffs")

df_var2 = pd.read_parquet("file_index.var2.preffs")

df_entire_dataset = pd.concat([df_coords, df_var1, df_var2]).sort_index()

df_entire_dataset.to_parquet("entire_dataset.preffs")

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/observingclouds/tar_referencer

Awesome Lists containing this project

README