{"id":50749095,"url":"https://github.com/musharna/data-aggregator-mcp","last_synced_at":"2026-06-11T00:00:50.963Z","repository":{"id":361577928,"uuid":"1254100112","full_name":"musharna/data-aggregator-mcp","owner":"musharna","description":"Unified research-data acquisition MCP — search \u0026 fetch datasets across Zenodo, DataCite, NCBI omics (GEO/SRA/BioProject), and literature (PubMed/OpenAIRE) behind one normalized model.","archived":false,"fork":false,"pushed_at":"2026-06-07T21:07:03.000Z","size":967,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-07T23:11:08.361Z","etag":null,"topics":["bioinformatics","datacite","datasets","mcp","model-context-protocol","ncbi","pubmed","python","research-data","zenodo"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/data-aggregator-mcp/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/musharna.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-30T06:28:04.000Z","updated_at":"2026-06-07T21:07:06.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/musharna/data-aggregator-mcp","commit_stats":null,"previous_names":["musharna/data-aggregator-mcp"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/musharna/data-aggregator-mcp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/musharna%2Fdata-aggregator-mcp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/musharna%2Fdata-aggregator-mcp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/musharna%2Fdata-aggregator-mcp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/musharna%2Fdata-aggregator-mcp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/musharna","download_url":"https://codeload.github.com/musharna/data-aggregator-mcp/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/musharna%2Fdata-aggregator-mcp/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34175887,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-10T02:00:07.152Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","datacite","datasets","mcp","model-context-protocol","ncbi","pubmed","python","research-data","zenodo"],"created_at":"2026-06-11T00:00:29.695Z","updated_at":"2026-06-11T00:00:50.955Z","avatar_url":"https://github.com/musharna.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🔎 data-aggregator-mcp\n\n**One MCP server to find and fetch research data across archives, omics\nregistries, and literature — behind a single normalized model.**\n\n[![PyPI](https://img.shields.io/pypi/v/data-aggregator-mcp.svg)](https://pypi.org/project/data-aggregator-mcp/)\n[![Python](https://img.shields.io/pypi/pyversions/data-aggregator-mcp.svg)](https://pypi.org/project/data-aggregator-mcp/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)\n[![CI](https://github.com/musharna/data-aggregator-mcp/actions/workflows/ci.yml/badge.svg)](https://github.com/musharna/data-aggregator-mcp/actions/workflows/ci.yml)\n[![Glama](https://glama.ai/mcp/servers/musharna/data-aggregator-mcp/badges/score.svg)](https://glama.ai/mcp/servers/musharna/data-aggregator-mcp)\n\n`search` one query across **Zenodo, DataCite** (Dryad / Figshare / Dataverse /\nOSF / Mendeley), **NCBI omics** (GEO / SRA / BioProject), **DataONE** (eco /\nenvironmental), **literature** (PubMed / OpenAIRE), **OmicsDI** (proteomics /\nmetabolomics), and **HuggingFace** datasets — deduplicated, normalized, and\ncross-linked. `resolve` any hit to its file manifest, citation, trust signals,\nand the data it points at. `fetch` it to disk with checksum verification.\n\nmcp-name: io.github.musharna/data-aggregator-mcp\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"examples/assets/demo.svg\"\n       alt=\"data-aggregator-mcp stdio demo — initialize, tools/list (search, resolve, fetch, list_sources), and a live list_sources call showing the four wired sources\"\n       width=\"820\"\u003e\n\u003c/p\u003e\n\n## ✨ Why this\n\nMost data MCPs wrap a single source. This one **unifies** them behind five tools\nand one `DataResource` model, so an agent searches once and gets back comparable\nrecords:\n\n- **Multi-domain, one model** — generalist archives + raw omics + literature,\n  deduplicated by DOI (the fetchable record wins over bare metadata).\n- **Taxonomy synonym expansion** — `organism=\"Orobanche aegyptiaca\"` also matches\n  `Phelipanche aegyptiaca` (NCBI Taxonomy), so a species rename doesn't cost you\n  results.\n- **Paper → data bridge** — resolve a paper and get links to the GEO / SRA /\n  BioProject / DataCite records it produced.\n- **Verified fetch** — streams to disk with md5 verification where the source\n  exposes a checksum, optional archive unpacking, and a fail-loud integrity\n  sniff that rejects an HTML paywall page served as a \"PDF\".\n- **Citations, access \u0026 full text** — render a citation in any CSL style, get\n  normalized access/license, and pull open-access full text — all in one\n  `resolve`.\n- **Trust signals** — usage `metrics` (citations / views / downloads / likes),\n  version status (`is_latest` / `superseded_by`), and `last_updated` freshness,\n  surfaced wherever the source exposes them.\n- **Interop exports** — `resolve(format=\"croissant\")` or `\"ro-crate\"` hands a\n  dataset to an ML or research-packaging pipeline as standard JSON-LD.\n- **Operate on data in place** — `operate` reads the schema, previews rows, or\n  runs a read-only SQL `SELECT` against a remote Parquet/CSV/TSV **without\n  downloading it** (Parquet footer + DuckDB httpfs range reads). Optional\n  `[operate]` extra; base install is unchanged.\n\n→ Full rationale and a comparison vs. single-source servers, breadth gateways, and\nML-dataset tools: **[docs/POSITIONING.md](docs/POSITIONING.md)**.\n\n## ⚡ Quickstart\n\nRun with no install:\n\n```bash\nuvx data-aggregator-mcp\n```\n\nRegister with Claude Code:\n\n```bash\nclaude mcp add data-aggregator -- uvx data-aggregator-mcp\n```\n\nA typical agent flow:\n\n```text\nsearch(\"drought stress RNA-seq\", organism=\"Sorghum bicolor\")\n  → [ geo:GSE..., sra:SRX..., zenodo:..., pubmed:... ]   # deduped, taxa-normalized\n\nresolve(\"sra:SRX079566\")\n  → DataResource{ files: [ENA FASTQ urls…], access: \"open\", taxa: [...] }\n\nfetch(\"sra:SRX079566\", dest=\"./data\")\n  → [\"./data/SRX079566_1.fastq.gz\", …]                   # md5-verified\n```\n\n\u003cdetails\u003e\n\u003csummary\u003eOther ways to run (pip, python -m, raw client config)\u003c/summary\u003e\n\n```bash\npip install data-aggregator-mcp\ndata-aggregator-mcp        # or: python -m data_aggregator_mcp\n```\n\nTo use the `operate` tool (query remote tabular files in place), install the\noptional extra:\n\n```bash\npip install \"data-aggregator-mcp[operate]\"\n```\n\nAdd to a client's MCP config (e.g. Claude Desktop `claude_desktop_config.json`):\n\n```json\n{\n  \"mcpServers\": {\n    \"data-aggregator\": {\n      \"command\": \"uvx\",\n      \"args\": [\"data-aggregator-mcp\"],\n      \"env\": { \"NCBI_API_KEY\": \"your-optional-key\" }\n    }\n  }\n}\n```\n\n\u003c/details\u003e\n\n## 🗂️ Sources\n\n| Source                       | Discover |       Fetch       |     Checksum     |\n| ---------------------------- | :------: | :---------------: | :--------------: |\n| Zenodo                       |    ✅    |        ✅         |       md5        |\n| DataCite → Figshare          |    ✅    |        ✅         |       md5        |\n| DataCite → Dataverse         |    ✅    |        ✅         |       md5        |\n| DataCite → OSF               |    ✅    |        ✅         |       md5        |\n| DataCite → Dryad             |    ✅    |  manifest only¹   | sha-256 (listed) |\n| DataCite → Mendeley \u0026 others |    ✅    |         —         |        —         |\n| NCBI SRA                     |    ✅    |  ✅ (ENA FASTQ)   |       md5        |\n| NCBI GEO                     |    ✅    |   ✅ (`suppl/`)   |      none²       |\n| NCBI BioProject              |    ✅    |    → SRA links    |        —         |\n| PubMed / OpenAIRE            |    ✅    | ✅ (OA full text) |      none²       |\n| HuggingFace datasets         |    ✅    | ✅ (resolve URL)  |       none       |\n| DataONE (eco/env)            |    ✅    | ✅ (Member Node)  |  md5 / sha-256   |\n| OmicsDI → PRIDE              |    ✅    |  ✅ (HTTPS FTP)   |    size only     |\n| OmicsDI → MetaboLights       |    ✅    |  ✅ (HTTPS FTP)   |       none       |\n| OmicsDI → other MS repos     |    ✅    |         —         |        —         |\n\n¹ Dryad downloads are token / bot-challenge gated, so `fetch` fails loud;\n`resolve` still lists the files.\n² No upstream checksum — `fetch` verifies content-type instead (rejects an HTML\npage served in place of a binary).\n\n## 🛠️ Tools\n\n### `search(query?, size?, sources?, organism?, kind?, published_after?, published_before?, rank?, cursor?)`\n\nFan out across all wired sources in parallel and return compact `DataResource`\nrecords, deduped by DOI. Per-source failures land in `errors{}` — never silently\ndropped.\n\n- `organism` — expand the query with NCBI-Taxonomy synonyms; the expansion is\n  echoed in `taxon_expansion`, and results carry normalized `taxa[]`\n  (`{taxid, name}`) plus a `described_in` link to plant-genomics-mcp for plant\n  taxa.\n- `sources` — restrict the fan-out, e.g. `[\"omics\"]`.\n- `size` — max results (1–50).\n- `kind` — keep only `dataset` / `sequencing_run` / `study` / `publication` /\n  `software`.\n- `published_after` / `published_before` — filter by publication year.\n- `rank` — `relevance` (default) or `semantic` (re-rank the fetched page by\n  embedding similarity to the query; needs `EMBEDDING_API_BASE`, degrades to\n  relevance order otherwise).\n- `cursor` — opaque token from a prior result's `next_cursor`; pages forward\n  across every source. In `cursor` mode the other params are read from the\n  token, so `query` is optional.\n\n### `resolve(id, cite?, format?)`\n\nFull record + files manifest. Routes by id shape — `zenodo:7654321`, a bare DOI,\n`datacite:10.5061/dryad.x`, an omics id (`sra:SRX079566`, `geo:GSE332789`,\n`bioproject:PRJNA1468572`), a literature id (`pubmed:34320281`, `openaire:\u003cid\u003e`),\na HuggingFace id (`hf:owner/name`), a DataONE id (`dataone:doi:10.5063/F1HT2M7Q`),\nor an OmicsDI id (`omicsdi:pride:PXD000001`). Attaches, where available:\n\n- **`files[]`** — ENA FASTQ manifest (SRA), GEO `suppl/`, or the host repo's\n  native manifest (Figshare / Dataverse / OSF / Dryad).\n- **`links[]`** — paper → data: `pubmed:` → `sra:` / `geo:` / `bioproject:` (NCBI\n  elink); `openaire:` → `datacite:` (ScholeXplorer Scholix).\n- **`access` / `license`** — normalized status\n  (`open` / `embargoed` / `restricted` / `closed` / `unknown`) and license where\n  the source exposes it.\n- **`identifiers`** — normalized `{pmid, pmcid, doi}`, plus an open-access\n  full-text `FileEntry` (EuropePMC XML, or an Unpaywall PDF fallback) for papers.\n- **`citation`** — pass `cite=\u003cformat\u003e`: `bibtex`, `ris`, `csl-json`, or any CSL\n  style name (`apa`, `mla`, `vancouver`, …). DOI records use content\n  negotiation; others render CSL-JSON from metadata. Off by default; failures\n  degrade quietly.\n- **trust signals** — `metrics` (citations / views / downloads / likes),\n  `is_latest` / `superseded_by` (derived from version links), and `last_updated`\n  freshness, where the source provides them.\n- **`format`** — pass `format=\"croissant\"` (file-level Croissant JSON-LD) or\n  `\"ro-crate\"` (minimal RO-Crate 1.1) to attach a standard manifest under the\n  matching field, for ML or research-packaging pipelines.\n\n### `fetch(id, dest?, files?, max_bytes?, force?, extract?)`\n\nDownload files to disk and return their paths. Streams under a `max_bytes` guard\n(`force` to override) with md5 verification wherever a checksum exists.\n\n- `files` — restrict to a subset of the resolved manifest.\n- `extract` — unpack downloaded zip / tar archives in place, guarded against\n  path traversal and runaway extracted size. Off by default.\n- Unverified fetches (GEO `suppl/`, literature full text) get a content-type\n  sniff that fails loud if a declared binary is actually an HTML page.\n- Fetchable: **Zenodo**, **SRA**, **GEO**, **DataONE** (Member-Node objects,\n  md5/sha-256 verified), DataCite-hosted **Figshare** / **Dataverse** / **OSF**,\n  **HuggingFace** datasets, **PRIDE** / **MetaboLights** (via OmicsDI, unverified),\n  and **literature** open-access full text. **Dryad**, other DataCite repos, and\n  other OmicsDI repos (MassIVE / GNPS / ...) are discovery-only and raise\n  `FetchNotSupportedError`.\n\n### `list_sources()`\n\nWired sources with their capabilities — layer, kinds, supported filters,\nfetchability, `operable` flag, id examples, auth, and rate limits.\n\n### `operate(op, id, file?, query?, n?, columns?)`\n\nInspect or query a remote tabular file (Parquet / CSV / TSV) **without\ndownloading it**. Addresses a file by catalog `id` + `file` name (defaults to the\nfirst tabular file on the resolved record). Ops:\n\n- `schema` — column names + types (reads the Parquet footer / sniffs the CSV\n  header; no full load).\n- `preview` — a small sample of rows.\n- `head` — the first `n` rows (default 20), optionally restricted to `columns`.\n- `sql` — a read-only `SELECT` (the file is the view `data`), e.g.\n  `SELECT col, count(*) FROM data GROUP BY 1`.\n\nBacked by the Parquet footer reader + DuckDB `httpfs` range reads. `sql` runs in\na locked-down DuckDB (read-only, local filesystem disabled, single-SELECT\nvalidation, row / wall-clock caps). Requires the optional `[operate]` extra\n(`pip install data-aggregator-mcp[operate]`); without it, `operate` returns a\nclear install-the-extra message and the other four tools are unaffected.\n\nAny HuggingFace dataset with a datasets-server converted view is operable\n(`schema` / `preview` / `head` / `sql`): `resolve` surfaces the auto-converted\nParquet files (`source=\"hf-datasets-server\"`) even for datasets stored as\nJSON/JSONL/arrow, so pass `file=\u003cconfig\u003e/\u003csplit\u003e/...parquet` to pick a split when\nthere are several.\n\n### Prompts\n\nThree workflow prompts surface in clients (e.g. `/mcp__data_aggregator__*` in\nClaude Code):\n\n- **`find_data`** — find datasets for a topic, optionally scoped to an organism.\n- **`data_behind_paper`** — find the datasets / accessions behind a paper.\n- **`search_resolve_fetch`** — walk the end-to-end search → resolve → fetch flow.\n\n## ⚙️ Configuration\n\nBoth optional, set via environment variables:\n\n- `NCBI_API_KEY` — raises the NCBI E-utilities rate limit (3 → 10 req/s) used by\n  the omics, literature, and taxonomy lookups.\n- `UNPAYWALL_EMAIL` — enables the Unpaywall fallback leg of literature full-text\n  retrieval (the EuropePMC leg works without it).\n\n## 🧪 Develop\n\n```bash\nuv venv \u0026\u0026 uv pip install -e \".[dev]\"\nuv run pytest -q\nuv run ruff check src tests\nDATA_AGGREGATOR_MCP_LIVE=1 uv run pytest -k live -q   # real-API probes\n```\n\nThe README demo (`examples/assets/demo.svg`) is recorded network-free from\n`examples/_demo_stdio.py` — see the header of that file to re-record.\n\n## License\n\nMIT — see [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmusharna%2Fdata-aggregator-mcp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmusharna%2Fdata-aggregator-mcp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmusharna%2Fdata-aggregator-mcp/lists"}