https://github.com/alejandrogzi/bed2gff
cool BED-to-GFF3 converter that runs in parallel
https://github.com/alejandrogzi/bed2gff
bed bioinformatics gene-annotation genome-annotation gff3
Last synced: 3 months ago
JSON representation
cool BED-to-GFF3 converter that runs in parallel
- Host: GitHub
- URL: https://github.com/alejandrogzi/bed2gff
- Owner: alejandrogzi
- License: mit
- Created: 2023-10-02T20:21:24.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2025-04-17T08:39:11.000Z (9 months ago)
- Last Synced: 2025-09-17T13:28:12.522Z (4 months ago)
- Topics: bed, bioinformatics, gene-annotation, genome-annotation, gff3
- Language: Rust
- Homepage:
- Size: 110 KB
- Stars: 9
- Watchers: 1
- Forks: 2
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README




# **bed2gff**
A Rust BED-to-GFF3 parallel translator.
translates
```
chr7 56766360 56805692 ENST00000581852.25 1000 + 56766360 56805692 0,0,200 3 3,135,81, 0,496,39251,
```
into
```
chr7 bed2gff gene 56399404 56805892 . + . ID=ENSG00000166960;gene_id=ENSG00000166960
chr7 bed2gff transcript 56766361 56805692 . + . ID=ENST00000581852.25;Parent=ENSG00000166960;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25
chr7 bed2gff exon 56766361 56766363 . + . ID=exon:ENST00000581852.25.1;Parent=ENST00000581852.25;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25,exon_number=1
chr7 bed2gff CDS 56766361 56766363 . + 0 ID=CDS:ENST00000581852.25.1;Parent=ENST00000581852.25;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25,exon_number=1
...
chr7 bed2gff start_codon 56766361 56766363 . + 0 ID=start_codon:ENST00000581852.25.1;Parent=ENST00000581852.25;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25,exon_number=1
chr7 bed2gff stop_codon 56805690 56805692 . + 0 ID=stop_codon:ENST00000581852.25.3;Parent=ENST00000581852.25;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25,exon_number=3
...
```
in a few seconds.
Converts
- *Homo sapiens* GRCh38 GENCODE 44 (252,835 transcripts) in 4.16 seconds.
- *Mus musculus* GRCm39 GENCODE 44 (149,547 transcritps) in 2.15 seconds.
- *Canis lupus* familiaris ROS_Cfam_1.0 Ensembl 110 (55,335 transcripts) in 1.30 seconds.
- *Gallus gallus* bGalGal1 Ensembl 110 (72,689 transcripts) in 1.51 seconds.
> What's new on v.0.1.5
>
> - Adds `--no-gene` flag to only perform conversion without isoforms!
> - Modifies `-i` to be required unless `--no-gene` mode is present.
> - Refactors BedRecord.
## Usage
``` text
Usage:
a) bed2gff[EXE] --bed --isoforms --output
b) bed2gff[EXE] --bed --output --no-gene
Arguments:
-b, --bed : a .bed file
-i, --isoforms : a tab-delimited file
-o, --output : path to output file
-n, --no-gene : Flag to disable gene_id feature [default: false]
Options:
--help: print help
--version: print version
--threads/-t: number of threads (default: max cpus)
--gz: compress output .gtf
```
>[!WARNING]
>
>All the transcripts in .bed file should appear in the isoforms file.
#### crate: [https://crates.io/crates/bed2gff](https://crates.io/crates/bed2gff)
click for detailed formats
bed2gff just needs two files:
1. a .bed file
tab-delimited files with 3 required and 9 optional fields:
```
chrom chromStart chromEnd name ...
| | | |
chr20 50222035 50222038 ENST00000595977 ...
```
see [BED format](https://genome.ucsc.edu/FAQ/FAQformat.html#format1) for more information
2. a tab-delimited .txt/.tsv/.csv/... file with genes/isoforms (all the transcripts in .bed file should appear in the isoforms file):
```
> cat isoforms.txt
ENSG00000198888 ENST00000361390
ENSG00000198763 ENST00000361453
ENSG00000198804 ENST00000361624
ENSG00000188868 ENST00000595977
```
you can build a custom file for your preferred species using [Ensembl BioMart](https://www.ensembl.org/biomart/martview).
## Installation
to install bed2gff on your system follow this steps:
1. get rust: `curl https://sh.rustup.rs -sSf | sh` on unix, or go [here](https://www.rust-lang.org/tools/install) for other options
2. run `cargo install bed2gff` (make sure `~/.cargo/bin` is in your `$PATH` before running it)
4. use `bed2gff` with the required arguments
5. enjoy!
## Build
to build bed2gff from this repo, do:
1. get rust (as described above)
2. run `git clone https://github.com/alejandrogzi/bed2gff.git && cd bed2gff`
3. run `cargo run --release -- -b -i -o `
## Container image
to build the development container image:
1. run `git clone https://github.com/alejandrogzi/bed2gff.git && cd bed2gff`
2. initialize docker with `start docker` or `systemctl start docker`
3. build the image `docker image build --tag bed2gff .`
4. run `docker run --rm -v "[dir_where_your_gtf_is]:/dir" bed2gff -b /dir/ -i /dir/ -o /dir/`
## Conda
to use bed2gff through Conda just:
1. `conda install bed2gff -c bioconda` or `conda create -n bed2gff -c bioconda bed2gff`
## Output
bed2gff will send the output directly to the same .bed file path if you specify so
```
bed2gff annotation.bed isoforms.txt output.gff
.
├── ...
├── isoforms.txt
├── annotation.bed
└── output.gff3
```
where `output.gff3` is the result.
## FAQ
### Why?
Converting formats is a daily practice in bioinformatics. This is way more common while working with gene annotations as tools differ in input/output layouts. GTF/GFF/BED are the most used structures to store gene-related annotations and the conversion needs are not well covered by available software.
A considerable portion of genomic tools reduce the software space by accepting GTF/GFF3 files only, directing BED users to translate their files into different formats. While some of this issues have already been covered (e.g. [bed2gtf](https://github.com/alejandrogzi/bed2gtf)) with GTF files, the GFF3 layout lacks stable converting tools (1, 2).
bed2gff is presented as a straightforward option to convert BED files into ready-to-use GFF3 files, closing that gap.
### How?
bed2gff, takes the base code of [bed2gtf](https://github.com/alejandrogzi/bed2gtf), that basically is the reimplementation of UCSC's C binaries merged in 1 step (bedToGenePred + genePredToGtf). This tool evaluates the position of exons and other features (CDS, stop/start, UTRs), preserving reading frames and adjusting the indexing count. The main approach now is a parallel algorithm that significantly reduces computation times.
Following the rationale of [bed2gtf](https://github.com/alejandrogzi/bed2gtf), bed2gff is able to produce a ready-to-use gff3 file by using an isoforms file, that works as the refTable in C binaries to map each transcript to their respective gene.
## References
1. https://bioinformatics.stackexchange.com/questions/2242/how-to-convert-bed-to-gff3
2. https://www.biostars.org/p/2/