An open API service indexing awesome lists of open source software.

https://github.com/alejandrogzi/bed2gtf

high-performance BED-to-GTF converter written in Rust
https://github.com/alejandrogzi/bed2gtf

bed bioinformatics gene-annotation genes genome-annotation gtf

Last synced: 8 months ago
JSON representation

high-performance BED-to-GTF converter written in Rust

Awesome Lists containing this project

README

          



bed2gtf



Version Badge


Crates.io Version


GitHub License


Crates.io Total Downloads


Conda Platform


A high-performance bed-to-gtf converter written in Rust.

// translates

chr27 17266469 17281218 ENST00000541931.8 1000 + 17266469 17281218 0,0,200 2 103,74, 0,14675,

// into

chr27 bed2gtf gene 17266470 17285418 . + . gene_id "ENSG00000151743";

chr27 bed2gtf transcript 17266470 17281218 . + . gene_id "ENSG00000151743"; transcript_id "ENST00000541931.8";

chr27 bed2gtf exon 17266470 17266572 . + . gene_id "ENSG00000151743"; transcript_id "ENST00000541931.8"; exon_number "1"; exon_id "ENST00000541931.8.1";

...

Converts
- *Homo sapiens* GRCh38 GENCODE 44 (252,835 transcripts) in 3.25 seconds.
- *Mus musculus* GRCm39 GENCODE 44 (149,547 transcritps) in 1.99 seconds.
- *Canis lupus familiaris* ROS_Cfam_1.0 Ensembl 110 (55,335 transcripts) in 1.20 seconds.
- *Gallus galus* bGalGal1 Ensembl 110 (72,689 transcripts) in 1.36 seconds.

> What's new on v.1.9.3
>
> - Fixes a bug with .gz decoder
> - Implements reading .bed.gz files!
> - Fixes bug described in issue #11 with versioning

## Usage
``` rust
Usage: bed2gtf[EXE] --bed/-b --isoforms/-i --output/-o

Arguments:
-b, --bed : a .bed file
-i, --isoforms : a tab-delimited file
-o, --output : path to output file
-g, --gz[=] Compress output file [default: false] [possible values: true, false]
-n, --no-gene[=] Flag to disable gene_id feature [default: false] [possible values: true, false]

Options:
--help: print help
--version: print version
--threads/-t: number of threads (default: max ncpus)
--gz: compress output .gtf
```

> [!WARNING]
>
> All the transcripts in .bed file should appear in the isoforms file.

> [!TIP]
>
> Here are some commands to get you started:
> ```bash
> # convert a .bed file to .gtf (if you have an isoforms file [gene -> transcript names] and want gene_ids in the output .gtf)
> bed2gtf -b file.bed -i isoforms.txt -o file.gtf
>
> # convert a ,.bed file to .gtf without isoforms [same things as UCSC bedToGtf]
> bed2gtf -b file.bed -o file.gtf --no-gene
>
> # convert a .bed.gz file to a .gtf [with or without isoforms]
> bed2gtf -b file.bed.gz -i isoforms.txt -o file.gtf --gz
> bed2gtf -b file.bed.gz -o file.gtf --gz --no-gene
>
> # convert a .bed.gz to a .gtf.gz [with or without isoforms]
> bed2gtf -b file.bed.gz -i isoforms.txt -o file.gtf --gz
> bed2gtf -b file.bed.gz -o file.gtf --gz --no-gene
>

#### crate: [https://crates.io/crates/bed2gtf](https://crates.io/crates/bed2gtf)

click for detailed formats


bed2gtf just needs two files:

1. a .bed file

tab-delimited files with 3 required and 9 optional fields:

```
chrom chromStart chromEnd name ...
| | | |
chr20 50222035 50222038 ENST00000595977 ...
```

see [BED format](https://genome.ucsc.edu/FAQ/FAQformat.html#format1) for more information

2. a tab-delimited .txt/.tsv/.csv/... file with genes/isoforms (all the transcripts in .bed file should appear in the isoforms file):

```
> cat isoforms.txt

ENSG00000198888 ENST00000361390
ENSG00000198763 ENST00000361453
ENSG00000198804 ENST00000361624
ENSG00000188868 ENST00000595977
```

you can build a custom file for your preferred species using [Ensembl BioMart](https://www.ensembl.org/biomart/martview).

## Installation
to install bed2gtf on your system follow this steps:
1. get rust: `curl https://sh.rustup.rs -sSf | sh` on unix, or go [here](https://www.rust-lang.org/tools/install) for other options
2. run `cargo install bed2gtf` (make sure `~/.cargo/bin` is in your `$PATH` before running it)
4. use `bed2gtf` with the required arguments
5. enjoy!

## Build
to build bed2gtf from this repo, do:

1. get rust (as described above)
2. run `git clone https://github.com/alejandrogzi/bed2gtf.git && cd bed2gtf`
3. run `cargo run --release -- -b -i -o `

## Container image
to build the development container image:
1. run `git clone https://github.com/alejandrogzi/bed2gtf.git && cd bed2gtf`
2. initialize docker with `start docker` or `systemctl start docker`
3. build the image `docker image build --tag bed2gtf .`
4. run `docker run --rm -v "[dir_where_your_gtf_is]:/dir" bed2gtf -b /dir/ -i /dir/ -o /dir/`

## Conda
to use bed2gtf through Conda just:
1. `conda install bed2gtf -c bioconda` or `conda create -n bed2gtf -c bioconda gtfsort`

## Output

bed2gtf will send the output directly to the same .bed file path if you specify so

```
bed2gtf annotation.bed isoforms.txt output.gtf

.
├── ...
├── isoforms.txt
├── annotation.bed
└── output.gtf
```
where `output.gtf` is the result.

## FAQ
### Why?
UCSC offers a fast way to convert BED into GTF files through KentUtils or specific binaries (1) + several other bioinformaticians have shared scripts trying to replicate a similar solution (2,3,4).

A GTF file is a 9-column tab-delimited file that holds gene annotation data for a specific assembly (5). The 9th column defines the attributes of each entry. This field is important, as some post-processing tools that handle GTF files need them to extract gene information (e.g. STAR, arriba, etc). An incomplete GTF attribute field would probably lead to annotation-related errors in these software.

Of the available tools/scripts mentioned above, none produce a fully functional attribute GTF file conversion. (1) uses a two-step approach (bedToGenePred | genePredToGtf) written in C, which is extremely fast. Since a .bed file does not preserve any gene-related information, this approach fails to a) include correct gene_id attributes (duplicated transcript_ids) if no refTable is included b) append 3rd column gene features.

This is an example:

```
chr27 stdin transcript 17266470 17281218 . + . gene_id "ENST00000541931.8"; transcript_id "ENST00000541931.8";

chr27 stdin exon 17266470 17266572 . + . gene_id "ENST00000541931.8"; transcript_id "ENST00000541931.8"; exon_number "1"; exon_id "ENST00000541931.8.1";
```

On the other hand, available scripts (2,3,4) fall into bad-formatted outputs unable to be used as input to other tools. Some of them show a very customed format, far from a complete GTF file (2):

```
chr20 ---- peak 50222035 50222038 . + . peak_id "chr20_50222035_50222038";

chr20 ---- peak 50188548 50189130 . + . peak_id "chr20_50188548_50189130";
```
and others (4) just provide exon-related information:

```
chr20 ensembl exon 50222035 50222038 . + . gene_id "ENST00000595977.1735"; transcript_id "ENST00000595977.1735"; exon_number "0

chr20 ensembl exon 50188548 50188930 . + . gene_id "ENST00000595977.3403"; transcript_id "ENST00000595977.3403"; exon_number "0
```

This is where bed2gtf comes in: a fast and memory efficient BED-to-GTF converter written in Rust. In ~4 seconds this tool produces a fully functional GTF converted file with all the needed features needed for post-processing tools.

### How?
bed2gtf is basically the reimplementation of C binaries merged in 1 step. This tool evaluates the position of k exons in j transcript, calculates start/stop/codon/UTR positions preserving reading frames and adjust the index + 1 (to be compatible with GTF convention). The isoforms file works as the refTable in C binaries to map each transcript to their respective gene; however, bed2gtf takes advantage of this and adds an additional "gene" line (to be compatible with other tools).

## References

1. http://hgdownload.soe.ucsc.edu/admin/exe/
2. https://github.com/pfurio/bed2gtf
3. https://rdrr.io/github/wyguo/RTDBox/src/R/gtf2bed.R
4. https://github.com/MikeDacre/mike_tools/blob/master/bin/bed2gtf.py
5. https://www.ensembl.org/info/website/upload/gff.html