https://github.com/perrette/datamanifest.toml
Shared TOML manifest schema for DataManifest.jl (Julia) and datamanifest (Python)
https://github.com/perrette/datamanifest.toml
Last synced: 6 days ago
JSON representation
Shared TOML manifest schema for DataManifest.jl (Julia) and datamanifest (Python)
- Host: GitHub
- URL: https://github.com/perrette/datamanifest.toml
- Owner: perrette
- License: mit
- Created: 2026-06-01T22:01:22.000Z (15 days ago)
- Default Branch: main
- Last Pushed: 2026-06-10T17:35:25.000Z (7 days ago)
- Last Synced: 2026-06-10T19:15:04.290Z (7 days ago)
- Language: Python
- Size: 1.28 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
[](https://perrette.github.io/datamanifest.toml/)
[](https://perrette.github.io/datamanifest.toml/schema/)
[](LICENSE)
A small, normative specification for the **`datamanifest.toml`** manifest format — a
TOML file that declares the data dependencies of a scientific project (each dataset's
source URI, checksum, version, format, and how to fetch and load it).
One `datasets.toml` is read by tools in different languages — today
[Python](https://github.com/perrette/datamanifest) and
[Julia](https://github.com/awi-esc/DataManifest.jl) — and covers fetching (download,
checksum, extract, load), portable storage, per-language bindings, and an optional
produce-or-load cache layer. The data model is `_META.schema = 1`; behavioural revisions
are tracked by spec tags (currently `spec-v5`).
- **One manifest, many languages.** A single `datasets.toml` declares each dataset's
source, checksum, format, and how to fetch and load it — and the same file is read
unchanged by tools in [Python](https://github.com/perrette/datamanifest) and
[Julia](https://github.com/awi-esc/DataManifest.jl).
- **Fetch, verify, extract, load.** A tool downloads the dataset, verifies its checksum,
unpacks the archive, and hands your code the local path — re-fetching only when it's
missing. Add a `format` and it loads the data into a native object too.
- **Portable, shared-by-default storage.** Fetched datasets live in one machine-global
keyed store (deduplicated across projects), the produced cache is per-project, and
per-machine layouts go in git-ignored config files or `[_STORAGE._HOST]` glob rules —
the repo itself stays data-free (repo-local folders are one edit away).
- **Produce-or-load caching.** An optional companion layer keys produced artifacts by a
hash of their parameters, so derived data is rebuilt only when its inputs change.
- **Normative and conformance-tested.** The prose spec is the source of truth, backed by
machine-readable JSON Schemas and a shared fixture suite both implementations run.
## 📖 Documentation
Full documentation lives at ****:
- [Quickstart](https://perrette.github.io/datamanifest.toml/quickstart/)
- Guide: [the manifest in one minute](https://perrette.github.io/datamanifest.toml/guide/manifest/),
[declaring datasets](https://perrette.github.io/datamanifest.toml/guide/datasets/),
[language bindings](https://perrette.github.io/datamanifest.toml/guide/bindings/),
[resolution](https://perrette.github.io/datamanifest.toml/guide/resolution/),
[storage](https://perrette.github.io/datamanifest.toml/guide/storage/),
[caching](https://perrette.github.io/datamanifest.toml/guide/caching/),
[maintenance](https://perrette.github.io/datamanifest.toml/guide/maintenance/),
[sync](https://perrette.github.io/datamanifest.toml/guide/sync/),
[conformance](https://perrette.github.io/datamanifest.toml/guide/conformance/),
[migration](https://perrette.github.io/datamanifest.toml/guide/migration/)
- [Schema specification](https://perrette.github.io/datamanifest.toml/schema/) (the normative `SCHEMA.md`)
- [JSON Schemas](https://perrette.github.io/datamanifest.toml/schemas/) ·
[Examples](https://perrette.github.io/datamanifest.toml/examples/) ·
[Conformance fixtures](https://perrette.github.io/datamanifest.toml/fixtures/)
- [Roadmap](https://perrette.github.io/datamanifest.toml/roadmap/) ·
[Changelog](https://perrette.github.io/datamanifest.toml/changelog/)
## Quick look
Declare a dataset — its source and checksum — in `datasets.toml`:
```toml
["jesstierney/lgmDA"]
uri = "https://github.com/jesstierney/lgmDA/archive/refs/tags/v2.1.zip"
checksum = "sha256:da5f85235baf7f858f1b52ed73405f5d4ed28a8f6da92e16070f86b724d8bb25"
extract = true
```
A tool downloads it, verifies the checksum, unpacks the archive, and hands your code the
local path — re-fetching only when it's missing. Add a `format` and it loads the data into a
native object too; the same file is read unchanged by tools in different languages. The full,
runnable manifest is at
[`examples/datasets.toml`](https://github.com/perrette/datamanifest.toml/blob/main/examples/datasets.toml),
and the [quickstart](https://perrette.github.io/datamanifest.toml/quickstart/) walks through a
fuller example.
## Implementations
The Python package [`perrette/datamanifest`](https://github.com/perrette/datamanifest) is
the **reference implementation** and ships the `datamanifest` command-line tool. A Julia
port, [`DataManifest.jl`](https://github.com/awi-esc/DataManifest.jl), tracks the same spec
and shares the conformance fixtures
([`tests/fixtures/`](https://github.com/perrette/datamanifest.toml/tree/main/tests/fixtures)),
so both read the same `datamanifest.toml`.
| Language | Repository | Description |
|---|---|---|
| Python *(reference)* | [perrette/datamanifest](https://github.com/perrette/datamanifest) | **The reference implementation.** Download, verify, extract, and load datasets declared in a manifest; uses entry-point loader references instead of inline code execution. Provides the **`datamanifest` command-line tool**. |
| Julia | [awi-esc/DataManifest.jl](https://github.com/awi-esc/DataManifest.jl) | Download, verify, extract, and load datasets declared in a manifest, with a Julia-native API. |
## From the same author
A few other open-source tools I maintain.
**Scientific writing & data**
- [**texmark**](https://perrette.github.io/texmark/) — write scientific articles in Markdown and convert them to journal-ready LaTeX/PDF.
- [**papers**](https://perrette.github.io/papers/) — command-line BibTeX bibliography and PDF library manager.
- [**datamanifest**](https://perrette.github.io/datamanifest/) — declarative, reproducible dataset management. *(See also the [DataManifest.jl](https://awi-esc.github.io/DataManifest.jl/) Julia port.)*
**Speech to Text (dictate) and Text to Speech (read-aloud) tools**
- [**scribe**](https://perrette.github.io/scribe/) — speech-to-text dictation.
- [**bard**](https://perrette.github.io/bard/) — text-to-speech reader.
## License
MIT — see [LICENSE](LICENSE).