https://github.com/dridk/pacbio_rna_seq

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/dridk/pacbio_rna_seq
Owner: dridk
Created: 2021-11-10T10:30:54.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2022-04-11T12:49:35.000Z (about 3 years ago)
Last Synced: 2025-01-31T17:52:30.776Z (3 months ago)
Language: Python
Size: 132 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: Readme.md

Awesome Lists containing this project

README

        This pipeline was created as part of the [GOLD project](https://aviesan.fr/fr/aviesan/accueil/menu-header/instituts-thematiques-multi-organismes/genetique-genomique-et-bioinformatique/programme-transversal-gold).

## Installation

#### Dependencies 

* [python >= 3.9 ](https://www.python.org/downloads)

   - seaborn

   - pandas

   - matplotlib

* [seqkit](https://bioinf.shenwei.me/seqkit/)

* [lima](https://lima.how/)

* [fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

* [minimap2](https://lh3.github.io/minimap2/)

* [samtools](http://www.htslib.org/)

* [bedtools](https://bedtools.readthedocs.io/en/latest/) 

#### Install environment from conda 

```bash

conda env create -n gold -f env.yaml

````

## Usage 

#### Clone the repository

```bash

git clone [email protected]:dridk/pacbio_rna_seq.git

```

#### Edit config.yaml

- ```FASTQ``` The Fastq file path generated by PacBio Sequencing 

- ```BARCODE``` The Fasta file path describing barcodes used by lima for demultiplexing ( see example in repository ) 

- ```PRIMERS``` The Fasta file describing primers used for PacBio amplicon sequencing ( see example in repository ) 

- ```REFERENCE``` The fasta reference file used by minimap2 for alignement ( e.g: hg19.fa ) 

#### Run the pipeline 

Put ```your_file.fastq``` generated by PacBio in the same folder than *config.yaml* and run the following command. 

You can edit how many threads you want to use with ```--cores``` option.

```

snakemake -Fp --cores 10 --configfile config.yaml 

```

## Output 

The pipeline will generate one file per barcode and amplicon. 

For instance HBB.bc1022.bam contains aligned reads from HBB amplicon and bc1022 barcode identifer.

- ```debarcoding.{barcode}--{barcode}.fastq``` : Demultiplexed reads 

- ```{amplicon}.{barcode}.fastq```  : Transcripts reads

- ```{amplicon}.{barcode}.bam```  : Aligned transcripts Reads 

- ```{amplicon}.{barcode}.bed```  : Transcripts structures as a bed file 

- ```{amplicon}.{barcode}.hash.bed```  : Transcripts structures as a bed file with a unique ID to identify the transcript

- ```{amplicon}.{barcode}.hash.png```  : Distribution plot of transcripts

- ```cluster.{amplicon}.png```  : Transcripts abundance heatmap 

For instance, the following heatmap shows transcript abundances for each barcode. 

Each transcript is identified by a hash number generated from the transcript structure bed file. 

This make possible to identify transcripts among differents samples.

![](https://github.com/dridk/pacbio_rna_seq/blob/5eadf2b089f1a6839397985baf873084898598b3/cluster.ACKR1.png)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dridk/pacbio_rna_seq

Awesome Lists containing this project

README