Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/AfshinLab/BLR
https://github.com/AfshinLab/BLR
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/AfshinLab/BLR
- Owner: AfshinLab
- License: mit
- Created: 2020-02-04T13:35:44.000Z (almost 5 years ago)
- Default Branch: main
- Last Pushed: 2023-12-08T12:56:50.000Z (about 1 year ago)
- Last Synced: 2024-07-31T20:28:39.178Z (6 months ago)
- Language: Python
- Size: 3.9 MB
- Stars: 5
- Watchers: 3
- Forks: 0
- Open Issues: 14
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
- awesome-linked-reads - BLR - to-end Snakemake workflow for whole genome haplotyping and structural variant calling from FASTQs from multiple linked-read technologies.|![GitHub last commit](https://img.shields.io/github/last-commit/AfshinLab/BLR?label=%20) (Tools)
README
![BLR logo](./doc/assets/logo_dark.png#gh-dark-mode-only)
![BLR logo](./doc/assets/logo.png#gh-light-mode-only)[![CI](https://github.com/AfshinLab/BLR/actions/workflows/ci.yml/badge.svg)](https://github.com/AfshinLab/BLR/actions/workflows/ci.yml) [![CI macos](https://github.com/AfshinLab/BLR/actions/workflows/ci_macos.yml/badge.svg)](https://github.com/AfshinLab/BLR/actions/workflows/ci_macos.yml)
- [About the pipeline](#About-the-pipeline)
- [Usage](#Usage)
- [Installation](#Installation)
- [Development](#development)
- [Citation](#citation)## About the pipeline
The BLR pipeline is end-to-end Snakemake workflow for whole genome haplotyping and structural variant calling from FASTQs, independent of LongRanger. The pipeline allow for input FASTQs from multiple linked-read technologies such as:
- [Droplet Barcode Sequencing (DBS)](doc/platforms.rst#dbs)
- [10x Genomics Chromium Genome](doc/platforms.rst#x-genomics)
- [Universal Sequencing TELL-seq](doc/platforms.rst#tell-seq)
- [MGI stLFR](doc/platforms.rst#stlfr)Read more about the integrated linked-read platforms [here](doc/platforms.rst).
![BLR pipeline](./doc/assets/pipeline.png)
The BLR pipeline is designed to be flexible and modular, allowing for easy integration of new linked-read technologies and tools. The pipeline is also designed to be run on a cluster environment, but can also be run locally.
Outlined below are the main processing step. Tools written in parenthesis indicate which are currently implemented for the current step with the preferred tool in *italic*.
- **FASTQ processing** (*tool depends on technology*): This initial step normalizes input FASTQ based on the linked-read technology used. This includes demultiplexing, barcode extraction and filtering as well as adaptor trimming.
- **Mapping** (*EMA*, BWA, minimap2, bowtie2, lariat): The reads are mapped to the reference genome using one of the available mappers.
- **BAM processing** (*BLR/Picard MarkDuplicates*): Collapse overlapping barcodes, mark duplicates, infer molecules (MI-tag) and filter reads.
- **Variant calling** (*DeepVariant*, GATK, FreeBayes, BCFtools): Call and filter short variants.
- **Variant phasing** (*HapCUT2*): Phase variants using the inferred molecules.
- **Haplotag alignments** (*WhatsHap*): Assign haplotype to reads (HP-tag).
- **Structural variant (SV) calling** (*NAIBR*): Call large structural variants (SV).Statistics are collected using standards tools such as FastQC, Picard and mosdepth as well as custom scripts that are part of BLR. These are then complied using [MultiQC](https://multiqc.info/) into a final HTML report.
## Usage
- [1. Setup analysis](#1-setup-an-analysis-folder)
- [2. Run analysis](#2-running-an-analysis)
- [3. Test files](#3-test-files)
- [4. Reference genome setup](#4-reference-genome-setup)
- [5. Merging different analysis runs](#5-merging-different-analysis-runs)
- [6. MultiQC plugin](#6-multiqc-plugin)### 1. Setup an analysis folder
Activate your generated conda environment (see [Installation](#Installation)).
conda activate blr
Create the analysis directory using `blr init`. Choose a name for the analysis, `output_folder` in this example. Specify the library type using the `-l` flag, here we choose `dbs`.
blr init --reads1=path/to/sample.R1.fastq.gz -l dbs path/to/output_folder
Note that BLR expects paired-end reads. However, only the path to the R1 file needs to be provided. The R2 file will be found automatically.
Move into your newly created analysis folder.
cd path/to/output_folder
Then, you may need to edit the configuration file `blr.yaml`, in particular
to enter the path to your indexed reference genome (see [Reference genome
setup](#4-reference-genome-setup) for more info).blr config --set genome_reference path/to/GRCh38.fasta
To see what other configurations can be altered, read the documentation in
the `blr.yaml` file or run `blr config` to print the current configs to the
terminal. Some configurations are specific to the linked
-read technology used for generating the library, more information can be
found [here](doc/platforms.rst).### 2. Running an analysis
Change working directory to your analysis folder
cd path/to/output_folder
The pipeline it launched using the `blr run` command. To automatically runs all steps run:
blr run
For more options, see the documentation.
blr run -h
### 3. Test files
For unit testing we use test files for different platforms. The latest version of these can be downloaded and unpacked using the following commands:
wget -nv https://export.uppmax.uu.se/uppstore2018173/blr-testdata-0.6.tar.gz
tar xf blr-testdata-0.6.tar.gz
ln -s blr-testdata-0.6 blr-testdataNow unit testing can be run locally from within the BLR directory using:
bash tests/run.sh
This is useful if you want to test your changes localy before submitting them
as a PR.### 4. Reference genome setup
To run the pipeline you need to provide a path to a FASTA with your reference
genome. The FASTA should be indexed depending on which mapper you whish to
use.- `bowtie2` uses a `bowtie2`-indexed reference
bowtie2-build genome.fasta genome.fasta
- `bwa`, `minimap2`, `ema` and `lariat` uses a `bwa`-indexed reference
bwa index genome.fasta
Additionally you need to index your FASTA using `samtools faidx` to get the
`genome.fasta.fai` filesamtools faidx genome.fasta
If using `gatk` for variant calling or doing base recalibrartion you will
need to generate a sequence dictionary (`genome.dict` file) which can be done
using:gatk CreateSequenceDictionary -R genome.fasta
### 5. Merging different analysis runs
If you have two or more libraries run on the same sample it is possible to
merge these inorder to increase coverage. First analysis should be run
separately for each library. Make sure that different `sample_nr` (set
using `blr config`) have been assigned to each library in order to not mix
overlapping barcodes. The files that will be merged from each library is
the filtered BAM (`final.phased.cram`, `final.phased.bam` or `final.bam`), the
molecule stats TSV (`final.molecule_stats.filtered.tsv`) and for DBS
and TELL-seq libraries the clustered barcodes (`barcodes.clstr.gz`).To merge the different runs we initialize a new analysis folder using `blr init`. In this example we have analysed two DBS library runs called `MySample_1` and `MySample_2`. Using the command below we can initialize a new folder called `MySample_merged`.
blr init -w /path/to/MySample_1 -w /path/to/MySample_2 --library-type dbs MySample_merged
Configs can then be updated as usual using `blr config`.
In order to merge the files and run analysis on the merged files a special subscript need to be run. This is done by running:
blr run --anew
Using this the files will be merged and the workflow run from varinat calling
and on.Note that this approach can also be used to rerun a single sample with
different configurations from variant calling and on.### 6. MultiQC plugin
There is a MultiQC plugin included in the BLR pipeline called MultiQC_BLR. If you wish to run MultiQC without this plugin include `--disable-blr-plugin` in your multiqc command.
The plugin allows for comparison between different runs. In this case go to the directory containing the folders for the runs you wish to compare. Then run:
multiqc -d .
The `-d` option prepends the directory name to each sample allowing differentiation between the runs.
## Installation
- [1. Setup conda](#1-setup-conda)
- [2. Create environment and install `blr`](#2-create-environment-and-install-blr)
- [3. Optional installations](#3-optional-installations)
- [4. Reusing conda environments](#4-reusing-conda-environments)### 1. Setup conda
Install [miniconda](https://docs.conda.io/en/latest/miniconda.html) if not already installed. You could also try copy-pasting the following to your terminal. This will download miniconda, install it to you `$HOME` folder.
if [[ $OSTYPE = "linux-gnu" ]]; then
wget http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
elif [[ $OSTYPE = "darwin"* ]]; then
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -O miniconda.sh
fi
bash miniconda.sh -b -p $HOME/miniconda
source $HOME/miniconda/etc/profile.d/conda.sh### 2. Create environment and install `blr`
Clone the BLR repository.
git clone https://github.com/AfshinLab/BLR.git
Install [conda-lock](https://github.com/conda-incubator/conda-lock#installation).
conda install -c conda-forge conda-lock
Create a conda environment, in which all dependencies will be installed. It
is recommended to use one of the OS locked files, i.e. `environment.linux-64
.lock` for linux or `environment.osx-64.lock` for mac, for reproducibility. One can also use the non-lock `environment.yml` file but this may introduce non-tested versions of software into the environment, so use with caution. For linux use the following to install and activate the environment.conda create --name blr --file environment.linux-64.lock
conda activate blrInstall `blr` into the environment.
pip install .
For development it can be useful to install `blr` in editable mode in this case use `pip install -e .`. This will install blr in such a way that you can still modify the source code and get any changes immediately without re-installing.
### 3. Optional installations
Here are some optional installs that are required if a specific software is requested.
#### 3.1 DeepVariant
To enable [DeepVariant](https://github.com/google/deepvariant), install it separately to your environment. Note that it is currently only available for linux.
conda activate blr
conda install deepvariantTo use DeepVariant for variant calling in your analysis, run:
blr config --set variant_caller deepvariant
#### 3.2 Lariat aligner
To use [lariat](https://github.com/10XGenomics/lariat) for alignment you need to manually install it within your environment. For help on installation see [the following instructions](doc/lariat_install.rst). To enable mapping using lariat, run:
blr config --set read_mapper lariat
#### 3.3 NAIBR (older versions)
The latest version of the [NAIBR repo](https://github.com/raphael-group/NAIBR) will be downloaded and used automatically. If you want to use another version of NAIBR this can be set through:
blr config --set naibr_path /path/to/NAIBR/
### 4. Reusing conda environments
Snakemake will generate separate conda environments for certain tools, e.g
. NAIBR, when needed. These are by default generated in the `.snakemake/conda
/` folder within the analysis directory. To reuse the same
enviroment across different runs its possible to set the environment
variable `$CONDA_ENVS` with the path to a common directory where
environments can be reused or generated as needed. To set the environment
variable temporary one can use:export CONDA_ENVS=/path/to/common/conda-envs/
It is also possible to set this variable as a part of the main conda
environment (in this case `blr`) using the following command:conda env config vars set CONDA_ENVS=/path/to/common/conda-envs/ -n blr
Deactivate and re-activate the environment for the change to take effect. To
remove this variable from the environment run:conda env config vars unset CONDA_ENVS -n blr
## Development
For more information on development go [here](doc/develop.rst).
## Citation
The BLR pipeline is outlined in:
> Höjer, P., Frick, T., Siga, H. *et al.* BLR: a flexible pipeline for haplotype analysis of multiple linked-read technologies, *Nucleic Acids Research*, gkad1010 (2023). https://doi.org/10.1093/nar/gkad1010
This is the main citation for the BLR pipeline.
BLR was originally developed for the prep-processing of [Droplet Barcode Sequencing (DBS)](doc/platforms.rst#dbs) data for input into the 10x LongRanger pipeline, see paper
> Redin, D., Frick, T., Aghelpasand, H. *et al.* High throughput barcoding method for genome-scale phasing. *Sci Rep* 9, 18116 (2019). https://doi.org/10.1038/s41598-019-54446-x
It has since been heavily modified to run completely independant of LongRanger. To run the analysis described in [Redin et al. 2019][2] look at the [stable branch](https://github.com/AfshinLab/BLR/tree/stable) for this git repository. That version is also available at [OMICtools](https://omictools.com/blr-tool).
[1]: https://doi.org/10.1093/nar/gkad1010 "Höjer et al. 2023"
[2]: https://doi.org/10.1038/s41598-019-54446-x "Redin et al. 2019"