An open API service indexing awesome lists of open source software.

https://github.com/allenai/cached_path

A file utility for accessing both local and remote files through a unified interface.
https://github.com/allenai/cached_path

caching downloader python python3

Last synced: 9 months ago
JSON representation

A file utility for accessing both local and remote files through a unified interface.

Awesome Lists containing this project

README

          

# [cached-path](https://cached-path.readthedocs.io/)

A file utility library that provides a unified, simple interface for accessing both local and remote files.
This can be used behind other APIs that need to access files agnostic to where they are located.



CI


PyPI


Documentation Status


License



## Quick links

- [Documentation](https://cached-path.readthedocs.io/)
- [PyPI Package](https://pypi.org/project/cached-path/)
- [Contributing](https://github.com/allenai/cached_path/blob/main/CONTRIBUTING.md)
- [License](https://github.com/allenai/cached_path/blob/main/LICENSE)

## Installation

**cached-path** requires Python 3.7 or later.

### Installing with `pip`

**cached-path** is available [on PyPI](https://pypi.org/project/cached-path/). Just run

```bash
pip install cached-path
```

### Installing from source

To install **cached-path** from source, first clone [the repository](https://github.com/allenai/cached_path):

```bash
git clone https://github.com/allenai/cached_path.git
cd cached_path
```

Then run

```bash
pip install -e .
```

## Usage

```python
from cached_path import cached_path
```

Given something that might be a URL or local path, `cached_path()` determines which.
If it's a remote resource, it downloads the file and caches it to the [cache directory](#cache-directory), and
then returns the path to the cached file. If it's already a local path,
it makes sure the file exists and returns the path.

For URLs, `http://`, `https://`, `s3://` (AWS S3), `gs://` (Google Cloud Storage), and `hf://` (HuggingFace Hub) are all supported out-of-the-box.
Optionally `beaker://` URLs in the form of `beaker://{user_name}/{dataset_name}/{file_path}` are supported, which requires [beaker-py](https://beaker-py-docs.allen.ai) to be installed.

For example, to download the PyTorch weights for the model `epwalsh/bert-xsmall-dummy`
on HuggingFace, you could do:

```python
cached_path("hf://epwalsh/bert-xsmall-dummy/pytorch_model.bin")
```

For paths or URLs that point to a tarfile or zipfile, you can also add a path
to a specific file to the `url_or_filename` preceeded by a "!", and the archive will
be automatically extracted (provided you set `extract_archive` to `True`),
returning the local path to the specific file. For example:

```python
cached_path("model.tar.gz!weights.th", extract_archive=True)
```

### Using custom headers for HTTP requests

You can provide custom headers for HTTP requests, which is useful for accessing private resources that require authentication:

```python
# Using an API token for private resources (e.g. Hugging Face)
headers = {"Authorization": f"Bearer {hf_token}"}
cached_path("https://huggingface.co/api/models/private-model/resolve/main/model.bin", headers=headers)
```

This is particularly useful for downloading private files from services like Hugging Face, GitHub, or any other API that uses Bearer token authentication.

### Cache directory

By default the cache directory is `~/.cache/cached_path/`, however there are several ways to override this setting:
- set the environment variable `CACHED_PATH_CACHE_ROOT`,
- call `set_cache_dir()`, or
- set the `cache_dir` argument each time you call `cached_path()`.

## Team

**cached-path** is developed and maintained by the AllenNLP team, backed by [the Allen Institute for Artificial Intelligence (AI2)](https://allenai.org/).
AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.
To learn more about who specifically contributed to this codebase, see [our contributors](https://github.com/allenai/cached_path/graphs/contributors) page.

## License

**cached-path** is licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0).
A full copy of the license can be found [on GitHub](https://github.com/allenai/cached_path/blob/main/LICENSE).