An open API service indexing awesome lists of open source software.

https://github.com/alejandrogzi/bed2gff

cool BED-to-GFF3 converter that runs in parallel
https://github.com/alejandrogzi/bed2gff

bed bioinformatics gene-annotation genome-annotation gff3

Last synced: 3 months ago
JSON representation

cool BED-to-GFF3 converter that runs in parallel

Awesome Lists containing this project

README

          

![Crates.io](https://img.shields.io/crates/v/bed2gff?color=green)
![GitHub](https://img.shields.io/github/license/alejandrogzi/bed2gff?color=blue)
![Crates.io Total Downloads](https://img.shields.io/crates/d/bed2gff)
![Conda Platform](https://img.shields.io/conda/pn/bioconda/bed2gff)

# **bed2gff**

A Rust BED-to-GFF3 parallel translator.

translates
```
chr7 56766360 56805692 ENST00000581852.25 1000 + 56766360 56805692 0,0,200 3 3,135,81, 0,496,39251,
```
into
```
chr7 bed2gff gene 56399404 56805892 . + . ID=ENSG00000166960;gene_id=ENSG00000166960

chr7 bed2gff transcript 56766361 56805692 . + . ID=ENST00000581852.25;Parent=ENSG00000166960;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25

chr7 bed2gff exon 56766361 56766363 . + . ID=exon:ENST00000581852.25.1;Parent=ENST00000581852.25;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25,exon_number=1

chr7 bed2gff CDS 56766361 56766363 . + 0 ID=CDS:ENST00000581852.25.1;Parent=ENST00000581852.25;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25,exon_number=1

...

chr7 bed2gff start_codon 56766361 56766363 . + 0 ID=start_codon:ENST00000581852.25.1;Parent=ENST00000581852.25;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25,exon_number=1

chr7 bed2gff stop_codon 56805690 56805692 . + 0 ID=stop_codon:ENST00000581852.25.3;Parent=ENST00000581852.25;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25,exon_number=3

...
```

in a few seconds.

Converts
- *Homo sapiens* GRCh38 GENCODE 44 (252,835 transcripts) in 4.16 seconds.
- *Mus musculus* GRCm39 GENCODE 44 (149,547 transcritps) in 2.15 seconds.
- *Canis lupus* familiaris ROS_Cfam_1.0 Ensembl 110 (55,335 transcripts) in 1.30 seconds.
- *Gallus gallus* bGalGal1 Ensembl 110 (72,689 transcripts) in 1.51 seconds.

> What's new on v.0.1.5
>
> - Adds `--no-gene` flag to only perform conversion without isoforms!
> - Modifies `-i` to be required unless `--no-gene` mode is present.
> - Refactors BedRecord.

## Usage
``` text
Usage:
a) bed2gff[EXE] --bed --isoforms --output
b) bed2gff[EXE] --bed --output --no-gene

Arguments:
-b, --bed : a .bed file
-i, --isoforms : a tab-delimited file
-o, --output : path to output file
-n, --no-gene : Flag to disable gene_id feature [default: false]

Options:
--help: print help
--version: print version
--threads/-t: number of threads (default: max cpus)
--gz: compress output .gtf
```

>[!WARNING]
>
>All the transcripts in .bed file should appear in the isoforms file.
#### crate: [https://crates.io/crates/bed2gff](https://crates.io/crates/bed2gff)

click for detailed formats


bed2gff just needs two files:

1. a .bed file

tab-delimited files with 3 required and 9 optional fields:

```
chrom chromStart chromEnd name ...
| | | |
chr20 50222035 50222038 ENST00000595977 ...
```

see [BED format](https://genome.ucsc.edu/FAQ/FAQformat.html#format1) for more information

2. a tab-delimited .txt/.tsv/.csv/... file with genes/isoforms (all the transcripts in .bed file should appear in the isoforms file):

```
> cat isoforms.txt

ENSG00000198888 ENST00000361390
ENSG00000198763 ENST00000361453
ENSG00000198804 ENST00000361624
ENSG00000188868 ENST00000595977
```

you can build a custom file for your preferred species using [Ensembl BioMart](https://www.ensembl.org/biomart/martview).

## Installation
to install bed2gff on your system follow this steps:
1. get rust: `curl https://sh.rustup.rs -sSf | sh` on unix, or go [here](https://www.rust-lang.org/tools/install) for other options
2. run `cargo install bed2gff` (make sure `~/.cargo/bin` is in your `$PATH` before running it)
4. use `bed2gff` with the required arguments
5. enjoy!

## Build
to build bed2gff from this repo, do:

1. get rust (as described above)
2. run `git clone https://github.com/alejandrogzi/bed2gff.git && cd bed2gff`
3. run `cargo run --release -- -b -i -o `

## Container image
to build the development container image:
1. run `git clone https://github.com/alejandrogzi/bed2gff.git && cd bed2gff`
2. initialize docker with `start docker` or `systemctl start docker`
3. build the image `docker image build --tag bed2gff .`
4. run `docker run --rm -v "[dir_where_your_gtf_is]:/dir" bed2gff -b /dir/ -i /dir/ -o /dir/`

## Conda
to use bed2gff through Conda just:
1. `conda install bed2gff -c bioconda` or `conda create -n bed2gff -c bioconda bed2gff`

## Output

bed2gff will send the output directly to the same .bed file path if you specify so

```
bed2gff annotation.bed isoforms.txt output.gff

.
├── ...
├── isoforms.txt
├── annotation.bed
└── output.gff3
```
where `output.gff3` is the result.

## FAQ
### Why?

Converting formats is a daily practice in bioinformatics. This is way more common while working with gene annotations as tools differ in input/output layouts. GTF/GFF/BED are the most used structures to store gene-related annotations and the conversion needs are not well covered by available software.

A considerable portion of genomic tools reduce the software space by accepting GTF/GFF3 files only, directing BED users to translate their files into different formats. While some of this issues have already been covered (e.g. [bed2gtf](https://github.com/alejandrogzi/bed2gtf)) with GTF files, the GFF3 layout lacks stable converting tools (1, 2).

bed2gff is presented as a straightforward option to convert BED files into ready-to-use GFF3 files, closing that gap.

### How?
bed2gff, takes the base code of [bed2gtf](https://github.com/alejandrogzi/bed2gtf), that basically is the reimplementation of UCSC's C binaries merged in 1 step (bedToGenePred + genePredToGtf). This tool evaluates the position of exons and other features (CDS, stop/start, UTRs), preserving reading frames and adjusting the indexing count. The main approach now is a parallel algorithm that significantly reduces computation times.

Following the rationale of [bed2gtf](https://github.com/alejandrogzi/bed2gtf), bed2gff is able to produce a ready-to-use gff3 file by using an isoforms file, that works as the refTable in C binaries to map each transcript to their respective gene.

## References

1. https://bioinformatics.stackexchange.com/questions/2242/how-to-convert-bed-to-gff3
2. https://www.biostars.org/p/2/