An open API service indexing awesome lists of open source software.

https://github.com/zyxue/gtf2csv

Convert genome annotation GTF file into plain CSV format
https://github.com/zyxue/gtf2csv

annotation annotation-processing csv genomics gff gtf

Last synced: 14 days ago
JSON representation

Convert genome annotation GTF file into plain CSV format

Awesome Lists containing this project

README

        

# GTF2CSV

Convert GTF/GFF2 to CSV for your convenience, e.g. insert it into a database or
load it into pandas dataframe for slicing and dicing.

### Download

I have converted multiple versions of gtf files for the human genome, and the
gtf files across multiple species in Ensembl release 93 to csv files, which are
available at https://gitlab.com/zyxue/gtf2csv-csvs.

Example:

Here are the first few lines of converted [Homo_sapiens.GRCh38.93.csv.gz](./download/ensembl):

| index | seqname | source | feature | start | end | score | strand | frame | ccds_id | exon_id | exon_number | exon_version | gene_biotype | gene_id | gene_name | gene_source | gene_version | protein_id | protein_version | tag:CCDS | tag:basic | tag:cds_end_NF | tag:cds_start_NF | tag:mRNA_end_NF | tag:mRNA_start_NF | tag:seleno | transcript_biotype | transcript_id | transcript_name | transcript_source | transcript_support_level | transcript_version |
|-------|---------|--------|------------|-------|-------|-------|--------|-------|---------|-----------------|-------------|--------------|------------------------------------|-----------------|-----------|-------------|--------------|------------|-----------------|----------|-----------|----------------|------------------|-----------------|-------------------|------------|----------------------|-----------------|-----------------|-------------------|--------------------------|--------------------|
| 0 | 1 | havana | gene | 11869 | 14409 | . | + | . | | | | | transcribed_unprocessed_pseudogene | ENSG00000223972 | DDX11L1 | havana | 5 | | | | | | | | | | | | | | | |
| 1 | 1 | havana | transcript | 11869 | 14409 | . | + | . | | | | | transcribed_unprocessed_pseudogene | ENSG00000223972 | DDX11L1 | havana | 5 | | | | 1 | | | | | | processed_transcript | ENST00000456328 | DDX11L1-202 | havana | 1 | 2 |
| 2 | 1 | havana | exon | 11869 | 12227 | . | + | . | | ENSE00002234944 | 1 | 1 | transcribed_unprocessed_pseudogene | ENSG00000223972 | DDX11L1 | havana | 5 | | | | 1 | | | | | | processed_transcript | ENST00000456328 | DDX11L1-202 | havana | 1 | 2 |
| 3 | 1 | havana | exon | 12613 | 12721 | . | + | . | | ENSE00003582793 | 2 | 1 | transcribed_unprocessed_pseudogene | ENSG00000223972 | DDX11L1 | havana | 5 | | | | 1 | | | | | | processed_transcript | ENST00000456328 | DDX11L1-202 | havana | 1 | 2 |
| 4 | 1 | havana | exon | 13221 | 14409 | . | + | . | | ENSE00002312635 | 3 | 1 | transcribed_unprocessed_pseudogene | ENSG00000223972 | DDX11L1 | havana | 5 | | | | 1 | | | | | | processed_transcript | ENST00000456328 | DDX11L1-202 | havana | 1 | 2 |

### Install & Usage

require python>=3.6

```
pip install git+https://github.com/zyxue/gtf2csv.git#egg=gtf2csv

gtf2csv --gtf [gtf file]
```

```
gtf2csv -h
usage: gtf2csv [-h] -f GTF [-c CARDINALITY_CUTOFF] [-o OUTPUT] [-m {csv,pkl}]
[-t NUM_CPUS]

Convert GTF file to plain csv

optional arguments:
-h, --help show this help message and exit
-f GTF, --gtf GTF the GTF file to convert
-c CARDINALITY_CUTOFF, --cardinality-cutoff CARDINALITY_CUTOFF
for a tag that may appear multiple times in the
attribute column (so-called multiplicity tag in this
program), if its cardinality, i.e. the number of
possibles values across all row, is lower than this
cutoff, then it's a low-caridnaltiy tag, and each of
its possible value would be transformed into a
separate binary column. Otherwise, it is a high-
cardinality tag and all of its values in one row would
be simply concatenated to avoid making too many
columns
-o OUTPUT, --output OUTPUT
the output filename, if not specified, would just set
it to be the same as the input but with extension
replaced (gtf => csv)
-m {csv,pkl}, --output-format {csv,pkl}
pkl means python pickle format, which would results in
much faster IO (recommended)
-t NUM_CPUS, --num-cpus NUM_CPUS
number of cpus for parallel processing, default to 1
```

### Comparison of multiple human gtf versions

See this notebook
[Comparison-of-human-gtfs.ipynb](https://github.com/zyxue/gtf2csv/blob/master/notebooks/Comparison-of-human-gtfs.ipynb)
for details.

**Number of protein coding genes**

This number has been relatively stable around 20k since early days.

Different colors indicate major genome update, i.e. GRCh36/hg18 (blue),
GRCh37/hg19 (red), GRCh38/hg38 (yellow).

**Number of protein coding transcripts**

Considering the current number is 80k, so on average a gene has 4 protein coding
transcripts.

**Number of lincRNA**

As seen, lincRNA hasn't been annotated until around GRCh37.57 (2010-03 based on
https://www.gencodegenes.org/releases/).

For plots of other available transcript types, please see
[here](https://gitlab.com/zyxue/gtf2csv-csvs/tree/master/human/figs/transcripts).

### Comparison of gtf files across different species

Here is a scatter plot of number of protein coding genes vs protein coding
transcripts for different species. Each dot is a species, but only those common
ones are annotated. For bar plots similar to above, see
[here](https://gitlab.com/zyxue/gtf2csv-csvs/tree/master/ensembl-release-93/figs/transcripts).

Details of plot generation can be found at
[Comparison-of-gtfs-across-species.ipynb](https://github.com/zyxue/gtf2csv/blob/master/notebooks/Comparison-of-gtfs-across-species.ipynb).

### Conversion strategy

The parsing of GTF is based on GTF/GFF2 format specified at
http://uswest.ensembl.org/info/website/upload/gff.html.

**The key transformation steps**:

1. ignore all lines starting with `#`.
2. convert all columns but the attribute column to csv.
3. Deal with attribute column.

The first two steps are straightforward. Note that GTF is tab-separated, so it
is very similar to a csv file.

The attribute column is a bit more tricky to deal with. Each row of the
attribute column contains a list of tag-value pairs. In principle, every tag
could form its own column. However, some tags could appear multiple times within
one row. A few observed such tags include:

* `tag` tag as in [Ensembl human gtf files](ftp://ftp.ensembl.org/pub/release-93/gtf/homo_sapiens/)
* `ont` tag as in [GENCODE human gtf files](https://www.gencodegenes.org/releases/current.html)
* `ccds_id` as in [Ensembl for Mus_musculus related gtf files](ftp://ftp.ensembl.org/pub/release-93/gtf/mus_musculus_129s1svimj/)

I named these tags are called multiplicity tags, and they are further classified
into two types depending on the number of possible unique values they have. For
those with a low number of possible values, thus low cardinality, each of their
possible values would be transformed into its own binary column under the name
([tag]:[value]). For example, for the follow `tag` tags,

```
... exon_id "ENSE00001637883"; tag "cds_end_NF"; tag "mRNA_end_NF";
```

It would converted into values in two binary (1/0) columns with column names
`tag:cds_end_NF` and `tag:mRNA_end_NF`.

For multiplicity tags with a high-cardinality (e.g. `ccds_id` with a cardinality
over 20k), converting each value into its own column would result into to many
columns and consume to much memory, thus the possible values would simply be
concatenated. For example, the following entry

```
... ccds_id "CCDS14805"; ccds_id "CCDS78538"; ccds_id "CCDS78539"; ...
```

would become `CCDS14805,CCDS78538,CCDS78539` under the `ccds_id` column.

The cutoff between high-cardinality and low-cardinality tags could be specified
via `-c/--cardinality-cutoff` parameter.

### Other resources

For a complete list of tags: https://www.gencodegenes.org/gencode_tags.html