https://github.com/musharna/data-aggregator-mcp
Unified research-data acquisition MCP β search & fetch datasets across Zenodo, DataCite, NCBI omics (GEO/SRA/BioProject), and literature (PubMed/OpenAIRE) behind one normalized model.
https://github.com/musharna/data-aggregator-mcp
bioinformatics datacite datasets mcp model-context-protocol ncbi pubmed python research-data zenodo
Last synced: 7 days ago
JSON representation
Unified research-data acquisition MCP β search & fetch datasets across Zenodo, DataCite, NCBI omics (GEO/SRA/BioProject), and literature (PubMed/OpenAIRE) behind one normalized model.
- Host: GitHub
- URL: https://github.com/musharna/data-aggregator-mcp
- Owner: musharna
- License: mit
- Created: 2026-05-30T06:28:04.000Z (19 days ago)
- Default Branch: main
- Last Pushed: 2026-06-07T21:07:03.000Z (10 days ago)
- Last Synced: 2026-06-07T23:11:08.361Z (10 days ago)
- Topics: bioinformatics, datacite, datasets, mcp, model-context-protocol, ncbi, pubmed, python, research-data, zenodo
- Language: Python
- Homepage: https://pypi.org/project/data-aggregator-mcp/
- Size: 944 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# π data-aggregator-mcp
**One MCP server to find and fetch research data across archives, omics
registries, and literature β behind a single normalized model.**
[](https://pypi.org/project/data-aggregator-mcp/)
[](https://pypi.org/project/data-aggregator-mcp/)
[](LICENSE)
[](https://github.com/musharna/data-aggregator-mcp/actions/workflows/ci.yml)
[](https://glama.ai/mcp/servers/musharna/data-aggregator-mcp)
`search` one query across **Zenodo, DataCite** (Dryad / Figshare / Dataverse /
OSF / Mendeley), **NCBI omics** (GEO / SRA / BioProject), **DataONE** (eco /
environmental), **literature** (PubMed / OpenAIRE), **OmicsDI** (proteomics /
metabolomics), and **HuggingFace** datasets β deduplicated, normalized, and
cross-linked. `resolve` any hit to its file manifest, citation, trust signals,
and the data it points at. `fetch` it to disk with checksum verification.
mcp-name: io.github.musharna/data-aggregator-mcp
## β¨ Why this
Most data MCPs wrap a single source. This one **unifies** them behind five tools
and one `DataResource` model, so an agent searches once and gets back comparable
records:
- **Multi-domain, one model** β generalist archives + raw omics + literature,
deduplicated by DOI (the fetchable record wins over bare metadata).
- **Taxonomy synonym expansion** β `organism="Orobanche aegyptiaca"` also matches
`Phelipanche aegyptiaca` (NCBI Taxonomy), so a species rename doesn't cost you
results.
- **Paper β data bridge** β resolve a paper and get links to the GEO / SRA /
BioProject / DataCite records it produced.
- **Verified fetch** β streams to disk with md5 verification where the source
exposes a checksum, optional archive unpacking, and a fail-loud integrity
sniff that rejects an HTML paywall page served as a "PDF".
- **Citations, access & full text** β render a citation in any CSL style, get
normalized access/license, and pull open-access full text β all in one
`resolve`.
- **Trust signals** β usage `metrics` (citations / views / downloads / likes),
version status (`is_latest` / `superseded_by`), and `last_updated` freshness,
surfaced wherever the source exposes them.
- **Interop exports** β `resolve(format="croissant")` or `"ro-crate"` hands a
dataset to an ML or research-packaging pipeline as standard JSON-LD.
- **Operate on data in place** β `operate` reads the schema, previews rows, or
runs a read-only SQL `SELECT` against a remote Parquet/CSV/TSV **without
downloading it** (Parquet footer + DuckDB httpfs range reads). Optional
`[operate]` extra; base install is unchanged.
β Full rationale and a comparison vs. single-source servers, breadth gateways, and
ML-dataset tools: **[docs/POSITIONING.md](docs/POSITIONING.md)**.
## β‘ Quickstart
Run with no install:
```bash
uvx data-aggregator-mcp
```
Register with Claude Code:
```bash
claude mcp add data-aggregator -- uvx data-aggregator-mcp
```
A typical agent flow:
```text
search("drought stress RNA-seq", organism="Sorghum bicolor")
β [ geo:GSE..., sra:SRX..., zenodo:..., pubmed:... ] # deduped, taxa-normalized
resolve("sra:SRX079566")
β DataResource{ files: [ENA FASTQ urlsβ¦], access: "open", taxa: [...] }
fetch("sra:SRX079566", dest="./data")
β ["./data/SRX079566_1.fastq.gz", β¦] # md5-verified
```
Other ways to run (pip, python -m, raw client config)
```bash
pip install data-aggregator-mcp
data-aggregator-mcp # or: python -m data_aggregator_mcp
```
To use the `operate` tool (query remote tabular files in place), install the
optional extra:
```bash
pip install "data-aggregator-mcp[operate]"
```
Add to a client's MCP config (e.g. Claude Desktop `claude_desktop_config.json`):
```json
{
"mcpServers": {
"data-aggregator": {
"command": "uvx",
"args": ["data-aggregator-mcp"],
"env": { "NCBI_API_KEY": "your-optional-key" }
}
}
}
```
## ποΈ Sources
| Source | Discover | Fetch | Checksum |
| ---------------------------- | :------: | :---------------: | :--------------: |
| Zenodo | β
| β
| md5 |
| DataCite β Figshare | β
| β
| md5 |
| DataCite β Dataverse | β
| β
| md5 |
| DataCite β OSF | β
| β
| md5 |
| DataCite β Dryad | β
| manifest onlyΒΉ | sha-256 (listed) |
| DataCite β Mendeley & others | β
| β | β |
| NCBI SRA | β
| β
(ENA FASTQ) | md5 |
| NCBI GEO | β
| β
(`suppl/`) | noneΒ² |
| NCBI BioProject | β
| β SRA links | β |
| PubMed / OpenAIRE | β
| β
(OA full text) | noneΒ² |
| HuggingFace datasets | β
| β
(resolve URL) | none |
| DataONE (eco/env) | β
| β
(Member Node) | md5 / sha-256 |
| OmicsDI β PRIDE | β
| β
(HTTPS FTP) | size only |
| OmicsDI β MetaboLights | β
| β
(HTTPS FTP) | none |
| OmicsDI β other MS repos | β
| β | β |
ΒΉ Dryad downloads are token / bot-challenge gated, so `fetch` fails loud;
`resolve` still lists the files.
Β² No upstream checksum β `fetch` verifies content-type instead (rejects an HTML
page served in place of a binary).
## π οΈ Tools
### `search(query?, size?, sources?, organism?, kind?, published_after?, published_before?, rank?, cursor?)`
Fan out across all wired sources in parallel and return compact `DataResource`
records, deduped by DOI. Per-source failures land in `errors{}` β never silently
dropped.
- `organism` β expand the query with NCBI-Taxonomy synonyms; the expansion is
echoed in `taxon_expansion`, and results carry normalized `taxa[]`
(`{taxid, name}`) plus a `described_in` link to plant-genomics-mcp for plant
taxa.
- `sources` β restrict the fan-out, e.g. `["omics"]`.
- `size` β max results (1β50).
- `kind` β keep only `dataset` / `sequencing_run` / `study` / `publication` /
`software`.
- `published_after` / `published_before` β filter by publication year.
- `rank` β `relevance` (default) or `semantic` (re-rank the fetched page by
embedding similarity to the query; needs `EMBEDDING_API_BASE`, degrades to
relevance order otherwise).
- `cursor` β opaque token from a prior result's `next_cursor`; pages forward
across every source. In `cursor` mode the other params are read from the
token, so `query` is optional.
### `resolve(id, cite?, format?)`
Full record + files manifest. Routes by id shape β `zenodo:7654321`, a bare DOI,
`datacite:10.5061/dryad.x`, an omics id (`sra:SRX079566`, `geo:GSE332789`,
`bioproject:PRJNA1468572`), a literature id (`pubmed:34320281`, `openaire:`),
a HuggingFace id (`hf:owner/name`), a DataONE id (`dataone:doi:10.5063/F1HT2M7Q`),
or an OmicsDI id (`omicsdi:pride:PXD000001`). Attaches, where available:
- **`files[]`** β ENA FASTQ manifest (SRA), GEO `suppl/`, or the host repo's
native manifest (Figshare / Dataverse / OSF / Dryad).
- **`links[]`** β paper β data: `pubmed:` β `sra:` / `geo:` / `bioproject:` (NCBI
elink); `openaire:` β `datacite:` (ScholeXplorer Scholix).
- **`access` / `license`** β normalized status
(`open` / `embargoed` / `restricted` / `closed` / `unknown`) and license where
the source exposes it.
- **`identifiers`** β normalized `{pmid, pmcid, doi}`, plus an open-access
full-text `FileEntry` (EuropePMC XML, or an Unpaywall PDF fallback) for papers.
- **`citation`** β pass `cite=`: `bibtex`, `ris`, `csl-json`, or any CSL
style name (`apa`, `mla`, `vancouver`, β¦). DOI records use content
negotiation; others render CSL-JSON from metadata. Off by default; failures
degrade quietly.
- **trust signals** β `metrics` (citations / views / downloads / likes),
`is_latest` / `superseded_by` (derived from version links), and `last_updated`
freshness, where the source provides them.
- **`format`** β pass `format="croissant"` (file-level Croissant JSON-LD) or
`"ro-crate"` (minimal RO-Crate 1.1) to attach a standard manifest under the
matching field, for ML or research-packaging pipelines.
### `fetch(id, dest?, files?, max_bytes?, force?, extract?)`
Download files to disk and return their paths. Streams under a `max_bytes` guard
(`force` to override) with md5 verification wherever a checksum exists.
- `files` β restrict to a subset of the resolved manifest.
- `extract` β unpack downloaded zip / tar archives in place, guarded against
path traversal and runaway extracted size. Off by default.
- Unverified fetches (GEO `suppl/`, literature full text) get a content-type
sniff that fails loud if a declared binary is actually an HTML page.
- Fetchable: **Zenodo**, **SRA**, **GEO**, **DataONE** (Member-Node objects,
md5/sha-256 verified), DataCite-hosted **Figshare** / **Dataverse** / **OSF**,
**HuggingFace** datasets, **PRIDE** / **MetaboLights** (via OmicsDI, unverified),
and **literature** open-access full text. **Dryad**, other DataCite repos, and
other OmicsDI repos (MassIVE / GNPS / ...) are discovery-only and raise
`FetchNotSupportedError`.
### `list_sources()`
Wired sources with their capabilities β layer, kinds, supported filters,
fetchability, `operable` flag, id examples, auth, and rate limits.
### `operate(op, id, file?, query?, n?, columns?)`
Inspect or query a remote tabular file (Parquet / CSV / TSV) **without
downloading it**. Addresses a file by catalog `id` + `file` name (defaults to the
first tabular file on the resolved record). Ops:
- `schema` β column names + types (reads the Parquet footer / sniffs the CSV
header; no full load).
- `preview` β a small sample of rows.
- `head` β the first `n` rows (default 20), optionally restricted to `columns`.
- `sql` β a read-only `SELECT` (the file is the view `data`), e.g.
`SELECT col, count(*) FROM data GROUP BY 1`.
Backed by the Parquet footer reader + DuckDB `httpfs` range reads. `sql` runs in
a locked-down DuckDB (read-only, local filesystem disabled, single-SELECT
validation, row / wall-clock caps). Requires the optional `[operate]` extra
(`pip install data-aggregator-mcp[operate]`); without it, `operate` returns a
clear install-the-extra message and the other four tools are unaffected.
Any HuggingFace dataset with a datasets-server converted view is operable
(`schema` / `preview` / `head` / `sql`): `resolve` surfaces the auto-converted
Parquet files (`source="hf-datasets-server"`) even for datasets stored as
JSON/JSONL/arrow, so pass `file=//...parquet` to pick a split when
there are several.
### Prompts
Three workflow prompts surface in clients (e.g. `/mcp__data_aggregator__*` in
Claude Code):
- **`find_data`** β find datasets for a topic, optionally scoped to an organism.
- **`data_behind_paper`** β find the datasets / accessions behind a paper.
- **`search_resolve_fetch`** β walk the end-to-end search β resolve β fetch flow.
## βοΈ Configuration
Both optional, set via environment variables:
- `NCBI_API_KEY` β raises the NCBI E-utilities rate limit (3 β 10 req/s) used by
the omics, literature, and taxonomy lookups.
- `UNPAYWALL_EMAIL` β enables the Unpaywall fallback leg of literature full-text
retrieval (the EuropePMC leg works without it).
## π§ͺ Develop
```bash
uv venv && uv pip install -e ".[dev]"
uv run pytest -q
uv run ruff check src tests
DATA_AGGREGATOR_MCP_LIVE=1 uv run pytest -k live -q # real-API probes
```
The README demo (`examples/assets/demo.svg`) is recorded network-free from
`examples/_demo_stdio.py` β see the header of that file to re-record.
## License
MIT β see [LICENSE](LICENSE).