An open API service indexing awesome lists of open source software.

https://github.com/musharna/data-aggregator-mcp

Unified research-data acquisition MCP β€” search & fetch datasets across Zenodo, DataCite, NCBI omics (GEO/SRA/BioProject), and literature (PubMed/OpenAIRE) behind one normalized model.
https://github.com/musharna/data-aggregator-mcp

bioinformatics datacite datasets mcp model-context-protocol ncbi pubmed python research-data zenodo

Last synced: 7 days ago
JSON representation

Unified research-data acquisition MCP β€” search & fetch datasets across Zenodo, DataCite, NCBI omics (GEO/SRA/BioProject), and literature (PubMed/OpenAIRE) behind one normalized model.

Awesome Lists containing this project

README

          

# πŸ”Ž data-aggregator-mcp

**One MCP server to find and fetch research data across archives, omics
registries, and literature β€” behind a single normalized model.**

[![PyPI](https://img.shields.io/pypi/v/data-aggregator-mcp.svg)](https://pypi.org/project/data-aggregator-mcp/)
[![Python](https://img.shields.io/pypi/pyversions/data-aggregator-mcp.svg)](https://pypi.org/project/data-aggregator-mcp/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![CI](https://github.com/musharna/data-aggregator-mcp/actions/workflows/ci.yml/badge.svg)](https://github.com/musharna/data-aggregator-mcp/actions/workflows/ci.yml)
[![Glama](https://glama.ai/mcp/servers/musharna/data-aggregator-mcp/badges/score.svg)](https://glama.ai/mcp/servers/musharna/data-aggregator-mcp)

`search` one query across **Zenodo, DataCite** (Dryad / Figshare / Dataverse /
OSF / Mendeley), **NCBI omics** (GEO / SRA / BioProject), **DataONE** (eco /
environmental), **literature** (PubMed / OpenAIRE), **OmicsDI** (proteomics /
metabolomics), and **HuggingFace** datasets β€” deduplicated, normalized, and
cross-linked. `resolve` any hit to its file manifest, citation, trust signals,
and the data it points at. `fetch` it to disk with checksum verification.

mcp-name: io.github.musharna/data-aggregator-mcp


data-aggregator-mcp stdio demo β€” initialize, tools/list (search, resolve, fetch, list_sources), and a live list_sources call showing the four wired sources

## ✨ Why this

Most data MCPs wrap a single source. This one **unifies** them behind five tools
and one `DataResource` model, so an agent searches once and gets back comparable
records:

- **Multi-domain, one model** β€” generalist archives + raw omics + literature,
deduplicated by DOI (the fetchable record wins over bare metadata).
- **Taxonomy synonym expansion** β€” `organism="Orobanche aegyptiaca"` also matches
`Phelipanche aegyptiaca` (NCBI Taxonomy), so a species rename doesn't cost you
results.
- **Paper β†’ data bridge** β€” resolve a paper and get links to the GEO / SRA /
BioProject / DataCite records it produced.
- **Verified fetch** β€” streams to disk with md5 verification where the source
exposes a checksum, optional archive unpacking, and a fail-loud integrity
sniff that rejects an HTML paywall page served as a "PDF".
- **Citations, access & full text** β€” render a citation in any CSL style, get
normalized access/license, and pull open-access full text β€” all in one
`resolve`.
- **Trust signals** β€” usage `metrics` (citations / views / downloads / likes),
version status (`is_latest` / `superseded_by`), and `last_updated` freshness,
surfaced wherever the source exposes them.
- **Interop exports** β€” `resolve(format="croissant")` or `"ro-crate"` hands a
dataset to an ML or research-packaging pipeline as standard JSON-LD.
- **Operate on data in place** β€” `operate` reads the schema, previews rows, or
runs a read-only SQL `SELECT` against a remote Parquet/CSV/TSV **without
downloading it** (Parquet footer + DuckDB httpfs range reads). Optional
`[operate]` extra; base install is unchanged.

β†’ Full rationale and a comparison vs. single-source servers, breadth gateways, and
ML-dataset tools: **[docs/POSITIONING.md](docs/POSITIONING.md)**.

## ⚑ Quickstart

Run with no install:

```bash
uvx data-aggregator-mcp
```

Register with Claude Code:

```bash
claude mcp add data-aggregator -- uvx data-aggregator-mcp
```

A typical agent flow:

```text
search("drought stress RNA-seq", organism="Sorghum bicolor")
β†’ [ geo:GSE..., sra:SRX..., zenodo:..., pubmed:... ] # deduped, taxa-normalized

resolve("sra:SRX079566")
β†’ DataResource{ files: [ENA FASTQ urls…], access: "open", taxa: [...] }

fetch("sra:SRX079566", dest="./data")
β†’ ["./data/SRX079566_1.fastq.gz", …] # md5-verified
```

Other ways to run (pip, python -m, raw client config)

```bash
pip install data-aggregator-mcp
data-aggregator-mcp # or: python -m data_aggregator_mcp
```

To use the `operate` tool (query remote tabular files in place), install the
optional extra:

```bash
pip install "data-aggregator-mcp[operate]"
```

Add to a client's MCP config (e.g. Claude Desktop `claude_desktop_config.json`):

```json
{
"mcpServers": {
"data-aggregator": {
"command": "uvx",
"args": ["data-aggregator-mcp"],
"env": { "NCBI_API_KEY": "your-optional-key" }
}
}
}
```

## πŸ—‚οΈ Sources

| Source | Discover | Fetch | Checksum |
| ---------------------------- | :------: | :---------------: | :--------------: |
| Zenodo | βœ… | βœ… | md5 |
| DataCite β†’ Figshare | βœ… | βœ… | md5 |
| DataCite β†’ Dataverse | βœ… | βœ… | md5 |
| DataCite β†’ OSF | βœ… | βœ… | md5 |
| DataCite β†’ Dryad | βœ… | manifest onlyΒΉ | sha-256 (listed) |
| DataCite β†’ Mendeley & others | βœ… | β€” | β€” |
| NCBI SRA | βœ… | βœ… (ENA FASTQ) | md5 |
| NCBI GEO | βœ… | βœ… (`suppl/`) | noneΒ² |
| NCBI BioProject | βœ… | β†’ SRA links | β€” |
| PubMed / OpenAIRE | βœ… | βœ… (OA full text) | noneΒ² |
| HuggingFace datasets | βœ… | βœ… (resolve URL) | none |
| DataONE (eco/env) | βœ… | βœ… (Member Node) | md5 / sha-256 |
| OmicsDI β†’ PRIDE | βœ… | βœ… (HTTPS FTP) | size only |
| OmicsDI β†’ MetaboLights | βœ… | βœ… (HTTPS FTP) | none |
| OmicsDI β†’ other MS repos | βœ… | β€” | β€” |

ΒΉ Dryad downloads are token / bot-challenge gated, so `fetch` fails loud;
`resolve` still lists the files.
Β² No upstream checksum β€” `fetch` verifies content-type instead (rejects an HTML
page served in place of a binary).

## πŸ› οΈ Tools

### `search(query?, size?, sources?, organism?, kind?, published_after?, published_before?, rank?, cursor?)`

Fan out across all wired sources in parallel and return compact `DataResource`
records, deduped by DOI. Per-source failures land in `errors{}` β€” never silently
dropped.

- `organism` β€” expand the query with NCBI-Taxonomy synonyms; the expansion is
echoed in `taxon_expansion`, and results carry normalized `taxa[]`
(`{taxid, name}`) plus a `described_in` link to plant-genomics-mcp for plant
taxa.
- `sources` β€” restrict the fan-out, e.g. `["omics"]`.
- `size` β€” max results (1–50).
- `kind` β€” keep only `dataset` / `sequencing_run` / `study` / `publication` /
`software`.
- `published_after` / `published_before` β€” filter by publication year.
- `rank` β€” `relevance` (default) or `semantic` (re-rank the fetched page by
embedding similarity to the query; needs `EMBEDDING_API_BASE`, degrades to
relevance order otherwise).
- `cursor` β€” opaque token from a prior result's `next_cursor`; pages forward
across every source. In `cursor` mode the other params are read from the
token, so `query` is optional.

### `resolve(id, cite?, format?)`

Full record + files manifest. Routes by id shape β€” `zenodo:7654321`, a bare DOI,
`datacite:10.5061/dryad.x`, an omics id (`sra:SRX079566`, `geo:GSE332789`,
`bioproject:PRJNA1468572`), a literature id (`pubmed:34320281`, `openaire:`),
a HuggingFace id (`hf:owner/name`), a DataONE id (`dataone:doi:10.5063/F1HT2M7Q`),
or an OmicsDI id (`omicsdi:pride:PXD000001`). Attaches, where available:

- **`files[]`** β€” ENA FASTQ manifest (SRA), GEO `suppl/`, or the host repo's
native manifest (Figshare / Dataverse / OSF / Dryad).
- **`links[]`** β€” paper β†’ data: `pubmed:` β†’ `sra:` / `geo:` / `bioproject:` (NCBI
elink); `openaire:` β†’ `datacite:` (ScholeXplorer Scholix).
- **`access` / `license`** β€” normalized status
(`open` / `embargoed` / `restricted` / `closed` / `unknown`) and license where
the source exposes it.
- **`identifiers`** β€” normalized `{pmid, pmcid, doi}`, plus an open-access
full-text `FileEntry` (EuropePMC XML, or an Unpaywall PDF fallback) for papers.
- **`citation`** β€” pass `cite=`: `bibtex`, `ris`, `csl-json`, or any CSL
style name (`apa`, `mla`, `vancouver`, …). DOI records use content
negotiation; others render CSL-JSON from metadata. Off by default; failures
degrade quietly.
- **trust signals** β€” `metrics` (citations / views / downloads / likes),
`is_latest` / `superseded_by` (derived from version links), and `last_updated`
freshness, where the source provides them.
- **`format`** β€” pass `format="croissant"` (file-level Croissant JSON-LD) or
`"ro-crate"` (minimal RO-Crate 1.1) to attach a standard manifest under the
matching field, for ML or research-packaging pipelines.

### `fetch(id, dest?, files?, max_bytes?, force?, extract?)`

Download files to disk and return their paths. Streams under a `max_bytes` guard
(`force` to override) with md5 verification wherever a checksum exists.

- `files` β€” restrict to a subset of the resolved manifest.
- `extract` β€” unpack downloaded zip / tar archives in place, guarded against
path traversal and runaway extracted size. Off by default.
- Unverified fetches (GEO `suppl/`, literature full text) get a content-type
sniff that fails loud if a declared binary is actually an HTML page.
- Fetchable: **Zenodo**, **SRA**, **GEO**, **DataONE** (Member-Node objects,
md5/sha-256 verified), DataCite-hosted **Figshare** / **Dataverse** / **OSF**,
**HuggingFace** datasets, **PRIDE** / **MetaboLights** (via OmicsDI, unverified),
and **literature** open-access full text. **Dryad**, other DataCite repos, and
other OmicsDI repos (MassIVE / GNPS / ...) are discovery-only and raise
`FetchNotSupportedError`.

### `list_sources()`

Wired sources with their capabilities β€” layer, kinds, supported filters,
fetchability, `operable` flag, id examples, auth, and rate limits.

### `operate(op, id, file?, query?, n?, columns?)`

Inspect or query a remote tabular file (Parquet / CSV / TSV) **without
downloading it**. Addresses a file by catalog `id` + `file` name (defaults to the
first tabular file on the resolved record). Ops:

- `schema` β€” column names + types (reads the Parquet footer / sniffs the CSV
header; no full load).
- `preview` β€” a small sample of rows.
- `head` β€” the first `n` rows (default 20), optionally restricted to `columns`.
- `sql` β€” a read-only `SELECT` (the file is the view `data`), e.g.
`SELECT col, count(*) FROM data GROUP BY 1`.

Backed by the Parquet footer reader + DuckDB `httpfs` range reads. `sql` runs in
a locked-down DuckDB (read-only, local filesystem disabled, single-SELECT
validation, row / wall-clock caps). Requires the optional `[operate]` extra
(`pip install data-aggregator-mcp[operate]`); without it, `operate` returns a
clear install-the-extra message and the other four tools are unaffected.

Any HuggingFace dataset with a datasets-server converted view is operable
(`schema` / `preview` / `head` / `sql`): `resolve` surfaces the auto-converted
Parquet files (`source="hf-datasets-server"`) even for datasets stored as
JSON/JSONL/arrow, so pass `file=//...parquet` to pick a split when
there are several.

### Prompts

Three workflow prompts surface in clients (e.g. `/mcp__data_aggregator__*` in
Claude Code):

- **`find_data`** β€” find datasets for a topic, optionally scoped to an organism.
- **`data_behind_paper`** β€” find the datasets / accessions behind a paper.
- **`search_resolve_fetch`** β€” walk the end-to-end search β†’ resolve β†’ fetch flow.

## βš™οΈ Configuration

Both optional, set via environment variables:

- `NCBI_API_KEY` β€” raises the NCBI E-utilities rate limit (3 β†’ 10 req/s) used by
the omics, literature, and taxonomy lookups.
- `UNPAYWALL_EMAIL` β€” enables the Unpaywall fallback leg of literature full-text
retrieval (the EuropePMC leg works without it).

## πŸ§ͺ Develop

```bash
uv venv && uv pip install -e ".[dev]"
uv run pytest -q
uv run ruff check src tests
DATA_AGGREGATOR_MCP_LIVE=1 uv run pytest -k live -q # real-API probes
```

The README demo (`examples/assets/demo.svg`) is recorded network-free from
`examples/_demo_stdio.py` β€” see the header of that file to re-record.

## License

MIT β€” see [LICENSE](LICENSE).