https://github.com/wtsi-hgi/nf_cellbender

Single cell Nextflow cellbender pipeline.
https://github.com/wtsi-hgi/nf_cellbender

Last synced: about 2 months ago
JSON representation

Single cell Nextflow cellbender pipeline.

Host: GitHub
URL: https://github.com/wtsi-hgi/nf_cellbender
Owner: wtsi-hgi
License: gpl-3.0
Created: 2021-02-12T15:41:05.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2022-03-18T15:59:53.000Z (about 3 years ago)
Last Synced: 2025-01-26T18:48:29.411Z (3 months ago)
Language: Python
Size: 177 KB
Stars: 2
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Description

The methods used in this module are described in `docs/methods.pdf`. TODO: `docs/methods.pdf`

Below is the structure of the results directory. The values that will be listed in `description_of_params` within the directory structure correspond to the various parameters one can set. An example of a paramters file is found in `example_runtime_setup/params.yml`.
```bash
nf-qc_cluster
├── normalization_001::description_of_params
│ ├── [files: data]
│ ├── reduced_dims-pca::description_of_params
│ │ ├── [files: data]
│ │ ├── [plots: umap]
│ │ ├── cluster_001::description_of_params
│ │ │ ├── [files: data,clusters]
│ │ │ ├── [plots: umap]
│ │ │ ├── cluster_markers_001::description_of_params
│ │ │ │ ├── [files: cluster_marker_genes]
│ │ │ │ └── [plots: marker_genes,marker_genes_dotplot]
│ │ │ ├── cluster_markers_002::description_of_params
│ │ │ ... etc. ...
│ │ ├── cluster_002::description_of_params
│ │ ... etc. ...
│ ├── reduced_dims-harmony_001::description_of_params
│ ├── reduced_dims-harmony_002::description_of_params
│ ... etc. ...
├── normalization_002::description_of_norm_params
... etc. ...
└── adata.h5 # concatenated single cell data with no normalization
```

# TODO list

* Add `docs/methods.pdf` file.
* Add brief description of module.

# Enhancement list

* `scanpy_merge-dev.py`: If it were important to have a per sample filter, merge could be re-designed to accommodate this.
* `scanpy_cluster.py`: Currently for clustering, we can change method (leiden or louvain), resolution, and n_pcs. Are there other parameters that need to be scaled over?
* Check phenotypes against predicted sex from gene expression.
* Add basic QC plots - try to do this in R from anndata frame?
* Scrublet functionality + add to metadata + cluster distributions
* Gene scores + add to metadata
* Add marker gene AUC like here http://www.nxn.se/valent/2018/3/5/actionable-scrna-seq-clusters
* Add summary ARI and LISI metrics computed over a list of many different cluster annotations?
* Add tSNE plots - rapid plots with OpenTSNE?
* Calculate marker genes with diffxpy or logreg?

# Quickstart

Quickstart for deploying this pipeline locally and on a high performance compute cluster.

## 1. Set up the environment

Install the required packages via conda:
```bash
# The repo directory.
REPO_MODULE="${HOME}/repo/path/to/this/pipeline"

# Install environment using Conda.
conda env create --name sc_qc_cluster --file ${REPO_MODULE}/env/environment.yml

# Activate the new Conda environment.
source activate sc_qc_cluster

# To update environment file:
#conda env export --no-builds | grep -v prefix | grep -v name > environment.yml
```

## 2. Prepare the input files

Generate and/or edit input files for the pipeline.

The pipeline takes as input:
1. **--file_paths_10x**: Tab-delimited file containing experiment_id and data_path_10x_format columns (i.e., list of input samples). Reqired.
2. **--file_metadata**: Tab-delimited file containing sample metadata. This will automatically be subset down to the sample list from 1. Reqired.
3. **--file_sample_qc**: YAML file containing sample qc and filtering parameters. Optional. NOTE: in the example config file, this is part of the YAML file for `-params-file`.
4. **--genes_exclude_hvg**: Tab-delimited file with genes to exclude from
highly variable gene list. Must contain ensembl_gene_id column. Optional.
5. **--genes_score**: Tab-delimited file with genes to use to score cells. Must contain ensembl_gene_id and score_idvcolumns. If one score_id == "cell_cycle", then requires a grouping_id column with "G2/M" and "S" (see example file in `example_runtime_setup`). Optional.
6. **-params-file**: YAML file containing analysis parameters. Optional.
7. **--run_multiplet**: Flag to run multiplet analysis. Optional.
8. **--file_cellmetadata**: Tab-delimited file containing experiment_id and data_path_cellmetadata columns. For instance this file can be used to pass per cell doublet annotations. Optional.

Examples of all of these files can be found in `example_runtime_setup/`.

## 3. Set up and run Nextflow

Run Nexflow locally (NOTE: if running on a virtual machine you may need to set `export QT_QPA_PLATFORM="offscreen"` for scanpy as described [here](https://github.com/ipython/ipython/issues/10627)):
```bash
# Boot up tmux session.
tmux new -s nf

# Here we are not going to filter any variable genes, so don't pass a file.
# NOTE: All input file paths should be full paths.
nextflow run "${REPO_MODULE}/main.nf" \
-profile "local" \
--file_paths_10x "${REPO_MODULE}/example_runtime_setup/file_paths_10x.tsv" \
--file_metadata "${REPO_MODULE}/example_runtime_setup/file_metadata.tsv" \
--genes_score "${REPO_MODULE}/example_runtime_setup/genes_score_v001.tsv" \
-params-file "${REPO_MODULE}/example_runtime_setup/params.yml"
```

Run Nextflow using LSF on a compute cluster. More on bgroups [here](https://www.ibm.com/support/knowledgecenter/SSETD4_9.1.3/lsf_config_ref/lsb.params.default_jobgroup.5.html).:
```bash
# Set the results directory.
RESULTS_DIR="/path/to/results/dir"
mkdir -p "${RESULTS_DIR}"

# Boot up tmux session.
tmux new -s nf

# Log into an interactive session.
# NOTE: Here we set the -G parameter due to our institute's LSF configuration.
bgadd "/${USER}/logins"
bsub -q normal -G team152 -g /${USER}/logins -Is -XF -M 8192 -R "select[mem>8192] rusage[mem=8192]" /bin/bash
# NOTE: If you are running over many cells, you may need to start an
# interactive session on a queue that allows long jobs
#bsub -q long -G team152 -g /${USER}/logins -Is -XF -M 18192 -R "select[mem>18192] rusage[mem=18192]" /bin/bash

# Activate the Conda environment (inherited by subsequent jobs).
conda activate sc_qc_cluster

# Set up a group to submit jobs to (export a default -g parameter).
bgadd -L 500 "/${USER}/nf"
export LSB_DEFAULT_JOBGROUP="/${USER}/nf"
# Depending on LSF setup, you may want to export a default -G parameter.
export LSB_DEFAULTGROUP="team152"
# NOTE: By setting the above flags, all of the nextflow LSF jobs will have
# these flags set.

# Settings for scanpy (see note above).
export QT_QPA_PLATFORM="offscreen"

# Change to a temporary runtime dir on the node. In this demo, we will change
# to the same execution directory.
cd ${RESULTS_DIR}

# Remove old logs and nextflow output (if one previously ran nextflow in this
# dir).
rm -r *html;
rm .nextflow.log*;

# NOTE: If you want to resume a previous workflow, add -resume to the flag.
# NOTE: If you do not want to filter any variable genes, pass an empty file to
# --genes_exclude_hvg. See previous local example.
# NOTE: --output_dir should be a full path - not relative.
nextflow run "${REPO_MODULE}/main.nf" \
-profile "lsf" \
--file_paths_10x "${REPO_MODULE}/example_runtime_setup/file_paths_10x.tsv" \
--file_metadata "${REPO_MODULE}/example_runtime_setup/file_metadata.tsv" \
--file_sample_qc "${REPO_MODULE}/example_runtime_setup/params.yml" \
--genes_exclude_hvg "${REPO_MODULE}/example_runtime_setup/genes_remove_hvg_v001.tsv" \
--genes_score "${REPO_MODULE}/example_runtime_setup/genes_score_v001.tsv" \
--output_dir "${RESULTS_DIR}" \
--run_multiplet \
-params-file "${REPO_MODULE}/example_runtime_setup/params.yml" \
-with-report "nf_report.html" \
-resume

# NOTE: If you would like to see the ongoing processes, look at the log files.
cat .nextflow.log
```

Example of how one may sync results:
```bash
NF_OUT_DIR="/path/to/out_dir"
rsync -am --include="*.png" --include="*/" --exclude="*" my_cluster_ssh:${NF_OUT_DIR} .
rsync -am --include="*.png" --include="*/" --exclude="*" my_cluster_ssh:${NF_OUT_DIR} .
```

# Notes

* On 10 April 2020, we found nextflow was writing some output into the `${HOME}` directory and had used up the alotted ~15Gb on the Sanger farm. This resulted in a Java error as soon as a nextflow command was executed. Based on file sizes within `${HOME}`, it seemed like the ouput was being written within the conda environment (following `du -h | sort -V -k 1`). By deleting and re-installing the coda environment, the problem was solved. The below flags may help prevent this from the future. In addition, setting the flag `export JAVA_OPTIONS=-Djava.io.tmpdir=/path/with/enough/space/` may also help.

```bash
# To be run from the execution dir, before the above nextflow command
# If you are running this on a cluster, make sure you log into an interactive
# session with >25Gb of RAM.
export NXF_OPTS="-Xms25G -Xmx25G"
export NXF_HOME=$(pwd)
export NXF_WORK="${NXF_HOME}/.nexflow_work"
export NXF_TEMP="${NXF_HOME}/.nexflow_temp"
export NXF_CONDA_CACHEDIR="${NXF_HOME}/.nexflow_conda"
export NXF_SINGULARITY_CACHEDIR="${NXF_HOME}/.nexflow_singularity"
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/wtsi-hgi/nf_cellbender

Awesome Lists containing this project

README