{"id":50759401,"url":"https://github.com/perrette/datamanifest","last_synced_at":"2026-06-11T08:31:01.526Z","repository":{"id":362039854,"uuid":"1018554353","full_name":"perrette/datamanifest","owner":"perrette","description":"Managing data dependencies for a scientific project","archived":false,"fork":false,"pushed_at":"2026-06-09T12:10:51.000Z","size":1711,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-09T13:24:30.798Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/perrette.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":"perrette","patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"lfx_crowdfunding":null,"polar":null,"buy_me_a_coffee":null,"thanks_dev":null,"custom":null}},"created_at":"2025-07-12T14:08:48.000Z","updated_at":"2026-06-09T12:11:25.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/perrette/datamanifest","commit_stats":null,"previous_names":["perrette/datamanifest"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/perrette/datamanifest","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/perrette%2Fdatamanifest","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/perrette%2Fdatamanifest/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/perrette%2Fdatamanifest/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/perrette%2Fdatamanifest/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/perrette","download_url":"https://codeload.github.com/perrette/datamanifest/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/perrette%2Fdatamanifest/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34190582,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-11T02:00:06.485Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-11T08:31:00.805Z","updated_at":"2026-06-11T08:31:01.518Z","avatar_url":"https://github.com/perrette.png","language":"Python","funding_links":["https://github.com/sponsors/perrette"],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cpicture\u003e\n    \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"https://raw.githubusercontent.com/perrette/datamanifest.toml/main/design/logo/lockup-dark.svg\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/perrette/datamanifest.toml/main/design/logo/lockup.svg\" alt=\"datamanifest.toml\" height=\"76\"\u003e\n  \u003c/picture\u003e\n\u003c/p\u003e\n\n# datamanifest[py]\n\n[![pypi](https://img.shields.io/pypi/v/datamanifestpy)](https://pypi.org/project/datamanifestpy)\n![python](https://img.shields.io/python/required-version-toml?tomlFilePath=https%3A%2F%2Fraw.githubusercontent.com%2Fperrette%2Fdatamanifest%2Frefs%2Fheads%2Fmain%2Fpyproject.toml)\n[![CI](https://github.com/perrette/datamanifest/actions/workflows/ci.yaml/badge.svg)](https://github.com/perrette/datamanifest/actions/workflows/ci.yaml)\n[![docs](https://img.shields.io/badge/docs-perrette.github.io%2Fdatamanifest-blue)](https://perrette.github.io/datamanifest/)\n\nKeep track of the datasets used in a scientific project. You declare your data\ndependencies — URLs, git repositories, checksums, formats — in a\n`datamanifest.toml` file; `datamanifest` downloads, verifies, extracts and loads\nthem, and caches your own computed results with the same machinery.\n\n\u003c!-- intro-start --\u003e\n- **A transparent, trackable manifest.** Every dataset a project depends on —\n  URLs, DOIs, checksums, formats — is listed in a single `datamanifest.toml` you\n  can read at a glance and version with git. The format is\n  [language-agnostic](https://perrette.github.io/datamanifest.toml) (today Python\n  and Julia) and can be edited by hand, from code, or through the CLI.\n- **Fetch from a wide range of sources.** Direct URLs, Zenodo/figshare and PANGAEA\n  DOIs, git repos, object stores (`s3://`, `gs://`, …), and bulk imports from pooch, intake\n  or DVC — all checksum-verified, extracted, and adopted in place when already on\n  disk.\n- **Cache your own computed data too.** The same tooling backs a robust `@cached`\n  mechanism that stores your own results with PID-lock, keyed by their inputs, to speed up\n  calculations locally. It is a separate, local concern — not a remote source —\n  but shares some of the same benefits such as data management via the CLI.\n- **A powerful CLI for data download, local management and synchronization across\n  machines.** Add and download datasets, inspect and repair what's on disk, move\n  or centralize where data is stored, and push/pull datasets and cached results\n  between machines over rsync+ssh — all without touching your analysis code. A\n  git-ignored `.datamanifest/state.toml` records where each object actually landed\n  on this machine, keeping local location tracking separate from the portable,\n  shareable manifest.\n\u003c!-- intro-end --\u003e\n\n## Installation\n\n```bash\npip install datamanifestpy\n```\n\nWith optional loader backends (`csv`, `parquet`, `nc`, `yaml`, `fsspec`, or\n`all`):\n\n```bash\npip install \"datamanifestpy[all]\"\n```\n\nSee the [installation page](https://perrette.github.io/datamanifest/installation/)\nfor the per-backend details.\n\n## Quickstart\n\n```bash\ndatamanifest init                  # create datamanifest.toml here\ndatamanifest add https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_annmean_mlo.csv --name co2\ndatamanifest list                  # what's tracked, and where it lives\ndatamanifest path co2              # resolve the on-disk path (for a script)\n```\n\nThen load it from your code:\n\n```python\nimport datamanifest\n\ndf = datamanifest.load_dataset(\"co2\")          # download on first use, then load\npath = datamanifest.get_dataset_path(\"co2\")    # just the on-disk path\n```\n\n**Commit `datamanifest.toml`** — the recipe of what to fetch and how. The data\nlives in a machine-wide shared store (deduplicated across your projects) and\nthe private `.datamanifest/` directory stays git-ignored; a collaborator clones\nand runs `datamanifest download`. See the\n[quickstart](https://perrette.github.io/datamanifest/quickstart/) for the full\nwalkthrough.\n\n## Documentation\n\nFull documentation lives at **\u003chttps://perrette.github.io/datamanifest/\u003e**:\n\n- [Installation](https://perrette.github.io/datamanifest/installation/)\n- [Quickstart](https://perrette.github.io/datamanifest/quickstart/)\n- [Using it from your code](https://perrette.github.io/datamanifest/api/) — `load_dataset`, `@cached`, the file-less `Database`\n- [Use cases](https://perrette.github.io/datamanifest/use-cases/) — add, repair, store, sync\n- [CLI reference](https://perrette.github.io/datamanifest/cli/)\n- [Storage model](https://perrette.github.io/datamanifest/storage/)\n- [Adding datasets](https://perrette.github.io/datamanifest/adding-datasets/) · [Importing from other tools](https://perrette.github.io/datamanifest/importing/)\n- [Language bindings](https://perrette.github.io/datamanifest/language-bindings/) · [Related projects](https://perrette.github.io/datamanifest/related/)\n\n## From the same author\n\nA few other open-source tools I maintain.\n\n**Scientific writing \u0026 data**\n\n- [**texmark**](https://perrette.github.io/texmark/) — write scientific articles in Markdown and convert them to journal-ready LaTeX/PDF.\n- [**papers**](https://perrette.github.io/papers/) — command-line BibTeX bibliography and PDF library manager.\n\n**Speech to Text (dictate) and Text to Speech (read-aloud) tools**\n\n- [**scribe**](https://perrette.github.io/scribe/) — speech-to-text dictation.\n- [**bard**](https://perrette.github.io/bard/) — text-to-speech reader.\n\n## Acknowledgments\n\n`datamanifest` is a Python port of\n[`awi-esc/DataManifest.jl`](https://github.com/awi-esc/DataManifest.jl), written\nby the same author (Mahé Perrette). The Python port was implemented with\nassistance from [Anthropic's Claude](https://www.anthropic.com/claude).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fperrette%2Fdatamanifest","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fperrette%2Fdatamanifest","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fperrette%2Fdatamanifest/lists"}