https://github.com/scverse/202603_hackathon_proteomics
https://github.com/scverse/202603_hackathon_proteomics
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/scverse/202603_hackathon_proteomics
- Owner: scverse
- License: bsd-3-clause
- Created: 2026-03-26T14:58:49.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-03-30T14:05:32.000Z (2 months ago)
- Last Synced: 2026-03-30T14:26:45.834Z (2 months ago)
- Language: Python
- Size: 30.3 KB
- Stars: 6
- Watchers: 0
- Forks: 3
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# README
Related [GitHub Issue](https://github.com/scverse/mudata/issues/111)
## Description
> Build a scverse-native data format that accounts for the hierarchical nature of quantification in MS-based proteomics.
### Problem Statement
In LC/MS-proteomics, there is a naturally arising hierarchical feature structure
- At the lowest level, mass spectrometers detect + quantify _fragments_ from charged peptides (_precursors_) in the mass spectrometry (MS) instruments. The precursor-level data is relatively large (N samples x ~100 000 features)
- Proteomics search engines identify the precursor sequences and match them to their corresponding proteins. Ultimately, search engines derive _protein_-specific intensities (N samples x 3000-10000 features).
The key challenge is that there exists an N:M relationship between precursors and proteins, i.e. many precursors can map to one protein and sometimes a precursor could potentially be derived from different (homologous) proteins.
The main extension of the data format to existing data containers like mudata would be the formalization the relationship/mapping between the fundamental units of quantification in MS-proteomics (fragments, precursors) and high-level, biologically more relevant aggregated features (peptides, proteins, genes).
- [x] Implement an RFC (see rfc/RFC.md).
- [x] Implement the prototypes of the data structure that have been proposed in an scverse (i.e. anndata/mudata) compatible manner.
- [ ] **Application**: Implement a related, simple downstream analysis that builds on the data format to get an intuition for the API (e.g. “Plot the distribution of all precursor intensities that correspond to a specific protein”)
- [ ] **Data ingestion**: Implement one proof-of-principle reader from a quantification pipeline/search engine output (e.g. QuantMS, DIANN, alphadia) to the data container.
- [ ] **Application**: Aggregate a low-level feature level (e.g. precursors) to a higher-level feature level (e.g. genes)
## Support
## Get started
We recommend contributors to make themselves familiar with the [mudata](https://mudata.readthedocs.io/stable/notebooks/nuances.html) documentation and API.
See also [QFeatures](https://rformassspectrometry.github.io/QFeatures/articles/QFeatures.html) for a conceptually similar R-package and [alphaquant](https://github.com/MannLabs/alphaquant.git) for a potential future application of the mapping approach.
## Example data
Use the download script to obtain example PSM reports.
```shell
bash download.sh
# download real world data
bash data/download.sh -o data/ albrecht2025
# download a minimal dataset
bash data/download.sh -o data/ minimal
```