Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/broadinstitute/depmap_omics

What you need to process the Quarterly DepMap-Omics releases from Terra
https://github.com/broadinstitute/depmap_omics

cancer-genomics cloud-computing data-science depmap

Last synced: about 1 month ago
JSON representation

What you need to process the Quarterly DepMap-Omics releases from Terra

Host: GitHub
URL: https://github.com/broadinstitute/depmap_omics
Owner: broadinstitute
Created: 2019-05-08T19:55:03.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2024-12-13T20:57:40.000Z (about 2 months ago)
Last Synced: 2024-12-13T21:28:04.231Z (about 2 months ago)
Topics: cancer-genomics, cloud-computing, data-science, depmap
Language: HTML
Homepage: https://depmap.org/portal/
Size: 418 MB
Stars: 112
Watchers: 17
Forks: 22
Open Issues: 6
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# depmap_omics

![](documentation/depmap-logo_white.png)

This repository contains code that processes data for the biannual [DepMap](https://www.depmap.org) data release. State of the pipeline for each release can be found under the "Releases" tab in this repo.

## Table of Contents
- [Getting Started](#quickstart)
- [Installation](#installation)
- [Repository File Structure](#file-structure)
- [Running the Pipeline](#running-pipeline)
- [Uploading and Preprocessing](#upload-preprocess)
- [Running Terra Pipelines](#running-terra-pipelines)
- [Downloading and Postprocessing](#downloading-postprocessing)
- [QC, Grouping and Uploading](#qc-grouping-uploading)

## Getting Started

The processing pipeline relies on the following tools:

- [python](https://www.learnpython.org/)
- [R](https://www.codecademy.com/learn/learn-r)
- [jupyter](https://jupyter.org/index.html)
- [WDL](https://software.broadinstitute.org/wdl/documentation/)
- [gcp](https://cloud.google.com/sdk/docs/quickstart-macos)
- [docker](https://docs.docker.com/get-started/)
- [Terra](https://software.broadinstitute.org/firecloud/documentation/)
- [The Terra Convention: The dos and donts for maintaining a cleaner terra.](https://docs.google.com/document/d/1zTtaN-Px64f8JvgydZNdBbzBpFWyZzEpshSNxQh43Oc/edit#heading=h.dz5wh0l4bu9g)
- [dalmatian](https://github.com/broadinstitute/dalmatian)

### Installatiion

`git clone http://github.com/BroadInstitute/depmap_omics.git && cd depmap_omics`

`pip install -e .`

### :warning: This repository needs other repos

Some important data and code from the [genepy Library](https://github.com/broadinstitute/genepy).

Use the instructions in the genepy page to install the package.

### :warning: You need the following R and python packages

1. You will need to install jupyter notetbooks and google cloud sdk
- install [Google Cloud SDK](https://cloud.google.com/sdk/docs/downloads-interactive).
- authenticate my SDK account by running `gcloud auth application-default login` in the terminal, and follow the instrucion to log in.

2. and GSVA for ssGSEA in R `R` run `R -e 'if(!requireNamespace("BiocManager", quietly = TRUE)){install.packages("BiocManager")};BiocManager::install(c("GSEABase", "erccdashboard", "GSVA", "DESeq2"));'`

3. For Python use the requirements.txt file `pip install -r requirements.txt`

### :warning: Follow instructions [here](documentation/getting_started.md) to set up Terra and obtain access to services required for running the pipeline.

## Repository File Structure

__ccle_tasks/__ Contains a notebook for each of the different additional processing that the CCLE team has to perform as well as one-off tasks run by the omics team

__data/__ Contains important information used for processing, including terra workspace configurations from past quarters

__depmapomics/__ Contains the core python code used in the pipeline and called by the processing jupyter notebooks

__\*\_pipeline/__ Contains some of the workflows' wdl files and script files used by these workflows

__temp/__ Contains the temp file that can get removed after processing (should be empty)

__documentation/__ Contains some additional files and diagrams for documenting the pipelines

__tests/__ Contains automated pytest functions used internally for development

__jupyter notebooks:__ `RNA_CCLE.ipynb` contains the DepMap processing pipelines for Expression and Fusion (from RNAseq data), and `WGS_CCLE.ipynb` contains the DepMap processing pipelines for Copy number and Mutations (from WGS/WES data)

## Pipeline Walkthrough

The processing pipelines are encapsulated in two jupyter notebooks (`RNA_CCLE.ipynb` and `WGS_CCLE.ipynb`). Each is divided into four steps: uploading, running Terra pipelines, local postprocessing, and uploading. Here is a detailed walkthrough (_Note that the steps that are "internal only" are run as part of DepMap's data processing, but not meant for external users to reproduce due to various dependencies that are unique to our team at the Broad. The "internal only" functions below can be found in the [depmap_omics_upload repo](https://github.com/broadinstitute/depmap_omics_upload)_):

### 1. Uploading and Preprocessing (internal only)

Currently, sequenced data for DepMap is generated by the Genomics Platform (GP) at the Broad who deposits them into several different Terra workspaces. Therefore, the first step of this pipeline is to look at these workspaces and

- identify new samples by looking at the bam files and compare them with bams we have already onboarded
- remove duplicates and ones with broken file paths
- onboard new samples and new versions of old cell lines if we find any

### 2. Running Terra Pipelines

We are using Dalmatian to send requests to Terra, so before running this part, external users need to make sure that the dalmatian `WorkspaceManager` object is initialized with the right workspace and that the functions are taking the correct workflow names as inputs.
You can then run the RNAseq and/or WGS pipelines on your samples.

**For a more in-depth documentation on what our pipelines contain, including the packages, input references, and parameters, please refer to this [summary of DepMap processing pipeline](documentation/DepMap_processing_pipeline.md).**

### 3. Downloading and Postprocessing (sections under **on local** in the notebooks)

This step will do a set of tasks:
- clean the workspaces by deleting large useless files, including unmapped bams.
- retrieve from the workspace interesting QC results.
- copy realigned bam files to our own data storage bucket (internal only).
- download the outputs from Terra pipelines.

The main postprocessing steps for each pipeline are as followed:

#### Copy Number

`copynumbers.py` contains the main postprocessing function `postProcess()` responsible for postprocessing segments and creating gene-level (relative and absolute) CN files and genomic feature table. Gene mapping information is retrieved from BioMart version `nov2020`. The function also applies the following filters to segment and CN data:

* Remove chrY segments from cell lines where their chrY segment count is bigger than 150
* Mark samples that have more than 1500 segments as QC fails and remove them
* Remove genes whose Entrez ID is NaN in BioMart in the gene-level matrices

_Internal only: `dm_omics.cnPostProcessing()` calls the above function on both WES and WGS data, merges them, renames the indices into ProfileIDs, and upload them to taiga._

Note: to get the exact same results as in DepMap, be sure to run `genecn = genecn.apply(lambda x: np.log2(1+x))` to the genecn dataframe in the CNV pipeline

#### Mutation

`mutations.py` contains `postProcess()`, the function responsible for postprocessing aggregated MAF files, genotyped mutation matrices (hot spot and damaging), binary guide mutation matrices, and structural variants (SVs).

_Internal only: `dm_omics.mutationPostProcessing()` calls the above function on both WES and WGS data, merges them, renames the indices into ProfileIDs, removes genes whose hugo symbol is not in biomart, generates individual mutation datasets for variant types, and uploads them to taiga. It also generates and uploads a binary matrix for germline mutations._

#### Expression

`expressions.py` contains the main postprocessing function responsible for postprocessing aggregated expression data from RSEM, which removes duplicates and QC failures, renames genes, filters and log transforms values, and generates transcrip-level, gene-level, and protein-coding gene-level expression data files. Gene mapping information is retrieved from BioMart version `nov2020`. Optionally, in addition, it also generates Single-sample GSEA (ssGSEA) data.

_Internal only: `dm_omics.expressionPostProcessing()` is a wrapper for the above function. It renames the indices into ProfileIDs and uploads the files to taiga._

#### Fusion

Functions that postprocess aggregated fusion data can be found in `fusions.py`. We want to apply filters to the fusion table to reduce the number of artifacts in the dataset. Specifically, we filter the following:

* Remove fusions involving mitochondrial chromosomes, or HLA genes, or immunoglobulin genes
* Remove red herring fusions (from STAR-Fusion annotations column)
* Remove fusions recurrent in CCLE (>= 25 samples)
* Remove fusions that have (SpliceType=" INCL_NON_REF_SPLICE" AND LargeAnchorSupport="No" AND FFPM < 0.1)
* Remove fusions with FFPM < 0.05 (STAR-Fusion suggests using 0.1, but looking at the translocation data, this looks like it might be too aggressive)

_Internal only: `dm_omics.fusionPostProcessing()` is a wrapper for the above function. It renames the indices into ProfileIDs and uploads the data to taiga._

### 4. QC, Grouping and Uploading to the Portal (internal use only)

We then perform the following QC tasks for each dataset:

#### CN

Once the CN files are saved, we load them back in python and do some validations, in brief:

- mean, max, var...
- to previous release: same mean, max, var...
- checkAmountOfSegments: flag any samples with a very high number of segments

#### Mutation

__Compare to previous release (broad only)__

We compare the results to the previous releases MAF. Namely:

- Count the total number of mutations per cell line, split by type (SNP, INS, DEL)
- Count the total number of mutations observed by position (group by chromosome, start position, end position and count the number of mutations)

##### REMARK:

Overall the filters applied after the CGA pipeline are the following:

We remove everything that:
- has AF<.1
- OR coverage <4
- OR alt cov=1
- OR is not in coding regions
- OR is in Exac with a frequency of >0.005%
- except if it is either
- in TCGA > 3 times
- OR in Cosmic > 10 times
- AND in a set of known cancer regions.
- OR exist in >5% of the CCLE samples
- except if they are in TCGA >5 times

#### RNA

Once the expression files are saved, we do the following validations:
- mean, max, var...
- comparison to previous release: same mean, max, var...
- we QC on the amount of genes with 0 counts for each samples

After QC, data is uploaded to taiga for all portal audiences according to release dates in Gumbo.

@[jkobject](https://www.jkobject.com)
@gkugener
@gmiller
@5im1z
@__[BroadInsitute](https://www.broadinstitute.org)

If you have any feedback or run into any issues, feel free to post an issue on the github repo.