https://github.com/tdayris/fair_genome_indexer
Download and index Ensembl sequences and annotations, remove non-canonical chromosimes, remove low TSL, index with multiple tools
https://github.com/tdayris/fair_genome_indexer
ensembl fair snakemake snakemake-workflow snakemake-wrappers
Last synced: 5 months ago
JSON representation
Download and index Ensembl sequences and annotations, remove non-canonical chromosimes, remove low TSL, index with multiple tools
- Host: GitHub
- URL: https://github.com/tdayris/fair_genome_indexer
- Owner: tdayris
- License: mit
- Created: 2023-11-15T12:54:12.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-07-01T08:14:58.000Z (11 months ago)
- Last Synced: 2025-09-05T06:30:03.789Z (9 months ago)
- Topics: ensembl, fair, snakemake, snakemake-workflow, snakemake-wrappers
- Language: Python
- Homepage:
- Size: 1.02 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
README
[](https://snakemake.github.io)
[](https://github.com/tdayris/fair_genome_indexer/actions?query=branch%3Amain+workflow%3ATests)
Snakemake workflow used to deploy and perform basic indexes of genome sequence.
This is done for teaching purpose as an example of FAIR principles applied with
Snakemake.
## Usage
The usage of this workflow is described in the [Snakemake workflow catalog](https://snakemake.github.io/snakemake-workflow-catalog?usage=tdayris/fair_genome_indexer), it is also available [locally](https://github.com/tdayris/fair_genome_indexer/blob/main/workflow/reports/usage.rst) on a single page.
## Results
The expected results of this pipeline are described [here](https://github.com/tdayris/fair_genome_indexer/blob/main/workflow/reports/results.rst).
## Material and methods
The tools used in this pipeline are described [here](https://github.com/tdayris/fair_genome_indexer/blob/main/workflow/reports/workflow.rst) textually.
## Step by step
### Get DNA sequences
| Step | Commands |
| -------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| Download DNA Fasta from Ensembl | [ensembl-sequence](https://snakemake-wrappers.readthedocs.io/en/v7.0.0/wrappers/reference/ensembl-sequence.html) |
| Remove non-canonical chromosomes | [pyfaidx](https://github.com/mdshw5/pyfaidx) |
| Index DNA sequence | [samtools](https://snakemake-wrappers.readthedocs.io/en/v7.0.0/wrappers/samtools/faidx.html) |
| Creatse sequence Dictionary | [picard](https://snakemake-wrappers.readthedocs.io/en/v7.0.0/wrappers/picard/createsequencedictionary.html) |
```
┌────────────────────────────────────────┐
│Download Ensembl Sequence (wget + gzip) │
└──────────────────┬─────────────────────┘
│
│
┌──────────────────▼────────────────────────┐
│Remove non-canonical chromosomes (pyfaidx) │
└──────────────────┬──────────────────────┬─┘
│ │
│ │
┌──────────────────▼──────────┐ ┌─▼───────────────────────────────────┐
│Index DNA Sequence (samtools)│ │Create sequence dictionary (Picard) │
└─────────────────────────────┘ └─────────────────────────────────────┘
```
### Get genome annotation (GTF)
| Step | Commands |
| ---------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- |
| Download GTF annotation | [ensembl-annotation](https://snakemake-wrappers.readthedocs.io/en/v7.0.0/wrappers/reference/ensembl-annotation.html) |
| Fix format errors | [Agat](https://agat.readthedocs.io/en/v4.5.0/tools/agat_convert_sp_gff2gtf.html) |
| Remove non-canonical chromosomes, based on above DNA Fasta | [Agat](https://agat.readthedocs.io/en/v4.5.0/tools/agat_sq_filter_feature_from_fasta.html) |
| Remove `` Transcript support levels | [Agat](https://agat.readthedocs.io/en/v4.5.0/tools/agat_sp_filter_feature_by_attribute_value.html) |
| Convert GTF to GenePred format | [gtf2genepred](https://snakemake-wrappers.readthedocs.io/en/v7.0.0/wrappers/ucsc/gtftogenepred.html) |
```
┌─────────────────────────────────────────┐
│Download Ensembl Annotation (wget + gzip)│
└─────────────┬───────────────────────────┘
│
│
┌─────────────▼─────────┐
│Fix format Error (Agat)│
└─────────────┬─────────┘
│
│
┌─────────────▼─────────────────────────┐ ┌────────────────────────────────────────┐
│Remove non-canonical chromosomes (Agat)◄───────────┤Fasta sequence index (see Get DNA Fasta)│
└─────────────┬─────────────────────────┘ └────────────────────────────────────────┘
│
│
┌─────────────▼───────────────────────┐
│Remove transcript levels (Agat) │
└─────────────┬───────────────────────┘
│
│
┌─────────────▼────────────────┐
│Convert GTF to GenePred (UCSC)│
└──────────────────────────────┘
```
### Get transcripts sequence
| Step | Commands |
| --------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- |
| Extract transcript sequences from above DNA Fasta and GTF | [gffread](https://snakemake-wrappers.readthedocs.io/en/v7.0.0/wrappers/gffread.html) |
| Index DNA sequence | [samtools](https://snakemake-wrappers.readthedocs.io/en/v7.0.0/wrappers/samtools/faidx.html) |
| Creatse sequence Dictionary | [picard](https://snakemake-wrappers.readthedocs.io/en/v7.0.0/wrappers/picard/createsequencedictionary.html) |
```
┌───────────────────────────────┐ ┌─────────────────────────────┐
│GTF (see get genome annotation)│ │DNA Fasta (See get dna fasta)│
└────────────────────┬──────────┘ └────────┬────────────────────┘
│ │
│ │
┌──────▼───────────────────────────▼─────┐
│Extract transcripts sequences (gffread) │
└──────┬───────────────────────────┬─────┘
│ │
│ │
┌────────────────────▼────┐ ┌────────▼───────────────────────────┐
│Index sequence (samtools)│ │Create sequence dictionary (Picard) │
└─────────────────────────┘ └────────────────────────────────────┘
```
### Get cDNA sequences
| Step | Commands |
| ----------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- |
| Extract coding transcripts from above GTF | [Agat](https://agat.readthedocs.io/en/v4.5.0/tools/agat_sp_filter_feature_by_attribute_value.html) |
| Extract coding sequences from above DNA Fasta and GTF | [gffread](https://snakemake-wrappers.readthedocs.io/en/v7.0.0/wrappers/gffread.html) |
| Index DNA sequence | [samtools](https://snakemake-wrappers.readthedocs.io/en/v7.0.0/wrappers/samtools/faidx.html) |
| Creatse sequence Dictionary | [picard](https://snakemake-wrappers.readthedocs.io/en/v7.0.0/wrappers/picard/createsequencedictionary.html) |
```
┌───────────────────────────────┐ ┌─────────────────────────────┐
│GTF (see get genome annotation)│ │DNA Fasta (See get dna fasta)│
└────────────────────┬──────────┘ └────────┬────────────────────┘
│ │
│ │
┌──────▼───────────────────────────▼─────┐
│Extract cDNA sequences (gffread) │
└──────┬───────────────────────────┬─────┘
│ │
│ │
┌────────────────────▼────┐ ┌────────▼───────────────────────────┐
│Index sequence (samtools)│ │Create sequence dictionary (Picard) │
└─────────────────────────┘ └────────────────────────────────────┘
```
### Get dbSNP variants
| Step | Commands |
| -------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- |
| Download dbSNP variants | [ensembl-variation](https://snakemake-wrappers.readthedocs.io/en/v7.0.0/wrappers/reference/ensembl-variation.html) |
| Filter non-canonical chromosomes | [pyfaidx](https://github.com/mdshw5/pyfaidx) + [BCFTools](https://snakemake-wrappers.readthedocs.io/en/v7.0.0/wrappers/bcftools/filter.html) |
| Index variants | [tabix](https://snakemake-wrappers.readthedocs.io/en/v7.0.0/wrappers/tabix/index.html) |
```
┌──────────────────────────────────────────┐
│Download dbSNP variants (wget + bcftools) │
└──────────┬───────────────────────────────┘
│
│
┌──────────▼───────────────────────────────────────────┐
│Remove non-canonical chromosomes (bcftools + bedtools)│
└──────────┬───────────────────────────────────────────┘
│
│
┌──────────▼─────────────┐
│Index variants (tabix) │
└────────────────────────┘
```
### Get transcript_id, gene_id, and gene_name correspondancy
| Step | Commands |
| ----------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Extract gene_id <-> gene_name correspondancy | [pyroe](https://snakemake-wrappers.readthedocs.io/en/v7.0.0/wrappers/pyroe/idtoname.html) |
| Extract transcript_id <-> gene_id <-> gene_name | [Agat](https://agat.readthedocs.io/en/v4.5.0/tools/agat_convert_sp_gff2tsv.html) + [XSV](https://snakemake-wrappers.readthedocs.io/en/v7.0.0/wrappers/xsv.html) |
```
┌────────────────────────────────┐
│Genome annotation (see get GTF) ├──────────────────┐
└──────┬─────────────────────────┘ │
│ │
│ │
┌──────▼──────────────────────────────┐ ┌────────▼─────────────────────────────────────────────┐
│Extract gene_id <-> gene_name (pyroe)│ │Extract gene_id <-> gene_name <-> transcript_id (Agat)│
└──────┬──────────────────────────────┘ └────────┬─────────────────────────────────────────────┘
│ │
│ │
┌──────▼─────┐ ┌────────▼────┐
│Format (XSV)│ │Format (XSV) │
└────────────┘ └─────────────┘
```
### Get blacklisted regions
| Step | Commands |
| ---------------------------- | -------------------------------------------------------------------------------------------- |
| Download blacklisted regions | [Github source](https://github.com/Boyle-Lab/Blacklist/tree/master/lists) |
| Merge overlapping intervals | [bedtools](https://snakemake-wrappers.readthedocs.io/en/v7.0.0/wrappers/bedtools/merge.html) |
```
┌────────────────────────────────┐
│Download known blacklists (wget)│
└────────────┬───────────────────┘
│
│
┌────────────▼──────────────────────────┐
│Merge overlapping intervals (bedtools) │
└───────────────────────────────────────┘
```
# GenePred format
| Step | Commands |
| --------------- | -------------------------------------------------------------------------------------------------- |
| GTF to GenePred | [UCSC-tools](https://snakemake-wrappers.readthedocs.io/en/v7.0.0/wrappers/ucsc/gtfToGenePred.html) |
```
┌────────────────────────────────┐
│Genome annotation (see get GTF) │
└────────────┬───────────────────┘
│
│
┌────────────▼──────────────┐
│GTFtoGenePred (UCSC-tools) │
└───────────────────────────┘
```
# 2bit format
| Step | Commands |
| --------------- | -------------------------------------------------------------------------------------------------- |
| Fasta to 2bit | [UCSC-tools](https://snakemake-wrappers.readthedocs.io/en/v7.0.0/wrappers/ucsc/faToTwoBit.html) |
```
┌────────────────────────────────┐
│Genome sequence (see get Fasta) │
└────────────┬───────────────────┘
│
│
┌────────────▼──────────────┐
│FaToTwoBit (UCSC-tools) │
└───────────────────────────┘
```
# STAR index
| Step | Commands |
| --------------- | -------------------------------------------------------------------------------------------------- |
| STAR index | [STAR](https://snakemake-wrappers.readthedocs.io/en/v7.0.0/wrappers/star/index.html) |
```
┌────────────────────────────────┐
│Genome sequence (see get DNA) │
└────────────┬───────────────────┘
│
│
┌───────▼────┐
│ STAR index │
└────────────┘
```
# Bowtie2 index
| Step | Commands |
| --------------- | -------------------------------------------------------------------------------------------------- |
| Bowtie2 build | [Bowtie2 build](https://snakemake-wrappers.readthedocs.io/en/v7.0.0/wrappers/bowtie2/build.html) |
```
┌────────────────────────────────┐
│Genome sequence (see get DNA) │
└────────────┬───────────────────┘
│
│
┌───────▼────┐
│ STAR index │
└────────────┘
```
# Salmon decoy aware gentrome index
| Step | Commands |
| --------------- | -------------------------------------------------------------------------------------------------- |
| Generate decoy | [Bash](https://snakemake-wrappers.readthedocs.io/en/v7.0.0/wrappers/salmon/decoys.html) |
| Salmon index | [Salmon](https://snakemake-wrappers.readthedocs.io/en/v7.0.0/wrappers/salmon/index.html) |
```
┌─────────────────────────────┐ ┌─────────────────────────────────────┐
│Genome sequence (see get DNA)│ │Transcriptome sequence (see get cDNA)│
└──────────────────────────┬──┘ └─────┬───────────────────────────────┘
│ │
│ │
│ │
┌────▼─────────────────▼────┐
│Generate decoy and gentrome│
└─────────────┬─────────────┘
│
┌─────────────────┐ │ ┌───────────────┐
│Gentrome sequence◄────────────────┴─────►Decoy sequences│
└────────────┬────┘ └────┬──────────┘
│ │
│ │
│ ┌──────────────┐ │
└───────► Salmon index ◄─────────┘
└──────────────┘
```