https://github.com/perrette/datamanifest
Managing data dependencies for a scientific project
https://github.com/perrette/datamanifest
Last synced: 6 days ago
JSON representation
Managing data dependencies for a scientific project
- Host: GitHub
- URL: https://github.com/perrette/datamanifest
- Owner: perrette
- License: mit
- Created: 2025-07-12T14:08:48.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2026-06-09T12:10:51.000Z (8 days ago)
- Last Synced: 2026-06-09T13:24:30.798Z (8 days ago)
- Language: Python
- Size: 1.63 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# datamanifest[py]
[](https://pypi.org/project/datamanifestpy)

[](https://github.com/perrette/datamanifest/actions/workflows/ci.yaml)
[](https://perrette.github.io/datamanifest/)
Keep track of the datasets used in a scientific project. You declare your data
dependencies — URLs, git repositories, checksums, formats — in a
`datamanifest.toml` file; `datamanifest` downloads, verifies, extracts and loads
them, and caches your own computed results with the same machinery.
- **A transparent, trackable manifest.** Every dataset a project depends on —
URLs, DOIs, checksums, formats — is listed in a single `datamanifest.toml` you
can read at a glance and version with git. The format is
[language-agnostic](https://perrette.github.io/datamanifest.toml) (today Python
and Julia) and can be edited by hand, from code, or through the CLI.
- **Fetch from a wide range of sources.** Direct URLs, Zenodo/figshare and PANGAEA
DOIs, git repos, object stores (`s3://`, `gs://`, …), and bulk imports from pooch, intake
or DVC — all checksum-verified, extracted, and adopted in place when already on
disk.
- **Cache your own computed data too.** The same tooling backs a robust `@cached`
mechanism that stores your own results with PID-lock, keyed by their inputs, to speed up
calculations locally. It is a separate, local concern — not a remote source —
but shares some of the same benefits such as data management via the CLI.
- **A powerful CLI for data download, local management and synchronization across
machines.** Add and download datasets, inspect and repair what's on disk, move
or centralize where data is stored, and push/pull datasets and cached results
between machines over rsync+ssh — all without touching your analysis code. A
git-ignored `.datamanifest/state.toml` records where each object actually landed
on this machine, keeping local location tracking separate from the portable,
shareable manifest.
## Installation
```bash
pip install datamanifestpy
```
With optional loader backends (`csv`, `parquet`, `nc`, `yaml`, `fsspec`, or
`all`):
```bash
pip install "datamanifestpy[all]"
```
See the [installation page](https://perrette.github.io/datamanifest/installation/)
for the per-backend details.
## Quickstart
```bash
datamanifest init # create datamanifest.toml here
datamanifest add https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_annmean_mlo.csv --name co2
datamanifest list # what's tracked, and where it lives
datamanifest path co2 # resolve the on-disk path (for a script)
```
Then load it from your code:
```python
import datamanifest
df = datamanifest.load_dataset("co2") # download on first use, then load
path = datamanifest.get_dataset_path("co2") # just the on-disk path
```
**Commit `datamanifest.toml`** — the recipe of what to fetch and how. The data
lives in a machine-wide shared store (deduplicated across your projects) and
the private `.datamanifest/` directory stays git-ignored; a collaborator clones
and runs `datamanifest download`. See the
[quickstart](https://perrette.github.io/datamanifest/quickstart/) for the full
walkthrough.
## Documentation
Full documentation lives at ****:
- [Installation](https://perrette.github.io/datamanifest/installation/)
- [Quickstart](https://perrette.github.io/datamanifest/quickstart/)
- [Using it from your code](https://perrette.github.io/datamanifest/api/) — `load_dataset`, `@cached`, the file-less `Database`
- [Use cases](https://perrette.github.io/datamanifest/use-cases/) — add, repair, store, sync
- [CLI reference](https://perrette.github.io/datamanifest/cli/)
- [Storage model](https://perrette.github.io/datamanifest/storage/)
- [Adding datasets](https://perrette.github.io/datamanifest/adding-datasets/) · [Importing from other tools](https://perrette.github.io/datamanifest/importing/)
- [Language bindings](https://perrette.github.io/datamanifest/language-bindings/) · [Related projects](https://perrette.github.io/datamanifest/related/)
## From the same author
A few other open-source tools I maintain.
**Scientific writing & data**
- [**texmark**](https://perrette.github.io/texmark/) — write scientific articles in Markdown and convert them to journal-ready LaTeX/PDF.
- [**papers**](https://perrette.github.io/papers/) — command-line BibTeX bibliography and PDF library manager.
**Speech to Text (dictate) and Text to Speech (read-aloud) tools**
- [**scribe**](https://perrette.github.io/scribe/) — speech-to-text dictation.
- [**bard**](https://perrette.github.io/bard/) — text-to-speech reader.
## Acknowledgments
`datamanifest` is a Python port of
[`awi-esc/DataManifest.jl`](https://github.com/awi-esc/DataManifest.jl), written
by the same author (Mahé Perrette). The Python port was implemented with
assistance from [Anthropic's Claude](https://www.anthropic.com/claude).