Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/cognoma/cancer-data

TCGA data acquisition and processing for Project Cognoma
https://github.com/cognoma/cancer-data

cancer data-acquisition dataphilly gene-expression mutation tcga xena xena-browser

Last synced: 3 months ago
JSON representation

TCGA data acquisition and processing for Project Cognoma

Host: GitHub
URL: https://github.com/cognoma/cancer-data
Owner: cognoma
License: other
Created: 2016-07-14T16:28:42.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2018-04-19T18:11:41.000Z (about 6 years ago)
Last Synced: 2024-03-16T04:42:17.383Z (3 months ago)
Topics: cancer, data-acquisition, dataphilly, gene-expression, mutation, tcga, xena, xena-browser
Language: Jupyter Notebook
Size: 19.4 MB
Stars: 21
Watchers: 11
Forks: 28
Open Issues: 13
Metadata Files:
- Readme: README.md
- License: LICENSE-BSD.md

Lists

awesome-biomedical-machine-learning - cancer-data: TCGA data acquisition and processing for Project Cognoma

README

        # Cancer data acquisition and processing for Project Cognoma

This is a mixed notebook and data repository for retrieving cancer data for [Project Cognoma](https://github.com/cognoma/cognoma).

Currently, all data is from the [TCGA Pan-Cancer collection](https://xenabrowser.net/datapages/?cohort=TCGA%20PanCanAtlas "UCSC Xena Browser cohort: TCGA PanCanAtlas") of the UCSC Xena Browser.

## Workflow

The data acquisition and analysis is executing by running Jupyter notebooks in the following order:

+ [`1.TCGA-download.ipynb`](1.TCGA-download.ipynb) — download and compress TCGA datasets.

+ [`2.TCGA-process.ipynb`](2.TCGA-process.ipynb) — convert downloaded TCGA datasets into sample × gene matrixes.

The [`execute.sh`](execute.sh) script executes the notebooks in order.

After installing and activating the [environment](#environment), run with the command `bash execute.sh` from the repository's root directory.

## Directories

The repository contains the following directories:

+ [`download`](download) — contains files retrieved from an external location whose content is unmodified.

Large downloaded files are tracked using Git LFS.

Associated metadata files are also retained for versioning.

+ [`data`](data) — contains generated datasets.

The complete matrix files are not currently tracked due to file size, but randomly-subsetted versions are available for development use (see [`data/subset`](data/subset)).

## Download

[![DOI: 10.6084/m9.figshare.3487685](https://img.shields.io/badge/DOI-10.6084/m9.figshare.3487685-blue.svg)](https://doi.org/10.6084/m9.figshare.3487685 "Complete datasets on figshare")

The complete datasets created by this repository (`data/expression-matrix.tsv.bz2` and `data/mutation-matrix.tsv.bz2`) are uploaded to [figshare](https://doi.org/10.6084/m9.figshare.3487685).

Since this is a manual process, check the figshare REFERENCES section to see which commit these datasets derive from.

In other words, the latest version on figshare may lag behind this repository.

## Environment

This repository uses [conda](https://conda.io/docs/) to manage its environment, which is named `cognoma-cancer-data`.

The required packages and versions are listed in [`environment.yml`](environment.yml).

If as a developer, you require an additional package, add it to `environment.yml`.

The following commands install and activate the environment:

```sh

# Create or overwrite the cognoma-cancer-data conda environment

conda env create --file=environment.yml

# Activate the conda environment (assumes conda >= 4.4)

conda activate cognoma-cancer-data

```

## License

This repository is dual licensed as [BSD 3-Clause](LICENSE-BSD.md) and [CC0 1.0](LICENSE-CC0.md), meaning any repository content can be used under either license.

This licensing arrangement ensures source code is available under an [OSI-approved License](https://opensource.org/licenses/alphabetical), while non-code content — such as figures, data, and documentation — is maximally reusable under a public domain dedication.