{"id":17652302,"url":"https://github.com/zyxue/gtf2csv","last_synced_at":"2025-05-07T07:33:24.741Z","repository":{"id":141945506,"uuid":"87253083","full_name":"zyxue/gtf2csv","owner":"zyxue","description":"Convert genome annotation GTF file into plain CSV format","archived":false,"fork":false,"pushed_at":"2018-10-03T22:11:32.000Z","size":1647,"stargazers_count":16,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-31T07:41:27.758Z","etag":null,"topics":["annotation","annotation-processing","csv","genomics","gff","gtf"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zyxue.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELog.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-04-05T01:42:10.000Z","updated_at":"2025-03-28T04:11:06.000Z","dependencies_parsed_at":"2023-03-13T10:27:12.145Z","dependency_job_id":null,"html_url":"https://github.com/zyxue/gtf2csv","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zyxue%2Fgtf2csv","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zyxue%2Fgtf2csv/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zyxue%2Fgtf2csv/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zyxue%2Fgtf2csv/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zyxue","download_url":"https://codeload.github.com/zyxue/gtf2csv/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252833648,"owners_count":21811227,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["annotation","annotation-processing","csv","genomics","gff","gtf"],"created_at":"2024-10-23T11:46:30.088Z","updated_at":"2025-05-07T07:33:24.727Z","avatar_url":"https://github.com/zyxue.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GTF2CSV\n\nConvert GTF/GFF2 to CSV for your convenience, e.g. insert it into a database or\nload it into pandas dataframe for slicing and dicing.\n\n### Download \n\nI have converted multiple versions of gtf files for the human genome, and the\ngtf files across multiple species in Ensembl release 93 to csv files, which are\navailable at https://gitlab.com/zyxue/gtf2csv-csvs.\n\nExample:\n\nHere are the first few lines of converted [Homo_sapiens.GRCh38.93.csv.gz](./download/ensembl):\n\n| index | seqname | source | feature    | start | end   | score | strand | frame | ccds_id | exon_id         | exon_number | exon_version | gene_biotype                       | gene_id         | gene_name | gene_source | gene_version | protein_id | protein_version | tag:CCDS | tag:basic | tag:cds_end_NF | tag:cds_start_NF | tag:mRNA_end_NF | tag:mRNA_start_NF | tag:seleno | transcript_biotype   | transcript_id   | transcript_name | transcript_source | transcript_support_level | transcript_version |\n|-------|---------|--------|------------|-------|-------|-------|--------|-------|---------|-----------------|-------------|--------------|------------------------------------|-----------------|-----------|-------------|--------------|------------|-----------------|----------|-----------|----------------|------------------|-----------------|-------------------|------------|----------------------|-----------------|-----------------|-------------------|--------------------------|--------------------|\n| 0     | 1       | havana | gene       | 11869 | 14409 | .     | +      | .     |         |                 |             |              | transcribed_unprocessed_pseudogene | ENSG00000223972 | DDX11L1   | havana      | 5            |            |                 |          |           |                |                  |                 |                   |            |                      |                 |                 |                   |                          |                    |\n| 1     | 1       | havana | transcript | 11869 | 14409 | .     | +      | .     |         |                 |             |              | transcribed_unprocessed_pseudogene | ENSG00000223972 | DDX11L1   | havana      | 5            |            |                 |          | 1         |                |                  |                 |                   |            | processed_transcript | ENST00000456328 | DDX11L1-202     | havana            | 1                        | 2                  |\n| 2     | 1       | havana | exon       | 11869 | 12227 | .     | +      | .     |         | ENSE00002234944 | 1           | 1            | transcribed_unprocessed_pseudogene | ENSG00000223972 | DDX11L1   | havana      | 5            |            |                 |          | 1         |                |                  |                 |                   |            | processed_transcript | ENST00000456328 | DDX11L1-202     | havana            | 1                        | 2                  |\n| 3     | 1       | havana | exon       | 12613 | 12721 | .     | +      | .     |         | ENSE00003582793 | 2           | 1            | transcribed_unprocessed_pseudogene | ENSG00000223972 | DDX11L1   | havana      | 5            |            |                 |          | 1         |                |                  |                 |                   |            | processed_transcript | ENST00000456328 | DDX11L1-202     | havana            | 1                        | 2                  |\n| 4     | 1       | havana | exon       | 13221 | 14409 | .     | +      | .     |         | ENSE00002312635 | 3           | 1            | transcribed_unprocessed_pseudogene | ENSG00000223972 | DDX11L1   | havana      | 5            |            |                 |          | 1         |                |                  |                 |                   |            | processed_transcript | ENST00000456328 | DDX11L1-202     | havana            | 1                        | 2                  |\n\n### Install \u0026 Usage\n\nrequire python\u003e=3.6\n\n```\npip install git+https://github.com/zyxue/gtf2csv.git#egg=gtf2csv\n\ngtf2csv --gtf [gtf file]\n```\n\n```\ngtf2csv -h\nusage: gtf2csv [-h] -f GTF [-c CARDINALITY_CUTOFF] [-o OUTPUT] [-m {csv,pkl}]\n               [-t NUM_CPUS]\n\nConvert GTF file to plain csv\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -f GTF, --gtf GTF     the GTF file to convert\n  -c CARDINALITY_CUTOFF, --cardinality-cutoff CARDINALITY_CUTOFF\n                        for a tag that may appear multiple times in the\n                        attribute column (so-called multiplicity tag in this\n                        program), if its cardinality, i.e. the number of\n                        possibles values across all row, is lower than this\n                        cutoff, then it's a low-caridnaltiy tag, and each of\n                        its possible value would be transformed into a\n                        separate binary column. Otherwise, it is a high-\n                        cardinality tag and all of its values in one row would\n                        be simply concatenated to avoid making too many\n                        columns\n  -o OUTPUT, --output OUTPUT\n                        the output filename, if not specified, would just set\n                        it to be the same as the input but with extension\n                        replaced (gtf =\u003e csv)\n  -m {csv,pkl}, --output-format {csv,pkl}\n                        pkl means python pickle format, which would results in\n                        much faster IO (recommended)\n  -t NUM_CPUS, --num-cpus NUM_CPUS\n                        number of cpus for parallel processing, default to 1\n```\n\n### Comparison of multiple human gtf versions\n\nSee this notebook\n[Comparison-of-human-gtfs.ipynb](https://github.com/zyxue/gtf2csv/blob/master/notebooks/Comparison-of-human-gtfs.ipynb)\nfor details.\n\n**Number of protein coding genes**\n\nThis number has been relatively stable around 20k since early days.\n\n\u003cimg src=\"https://gitlab.com/zyxue/gtf2csv-csvs/raw/master/human/figs/num_protein_coding_genes.jpg\" alt width=\"100%\"\u003e\n\n\nDifferent colors indicate major genome update, i.e. GRCh36/hg18 (blue),\nGRCh37/hg19 (red), GRCh38/hg38 (yellow).\n\n\n**Number of protein coding transcripts**\n\nConsidering the current number is 80k, so on average a gene has 4 protein coding\ntranscripts.\n\n\u003cimg src=\"https://gitlab.com/zyxue/gtf2csv-csvs/raw/master/human/figs/transcripts/protein_coding_transcripts.jpg\" alt width=\"100%\"\u003e\n\n\n**Number of lincRNA**\n\n\u003cimg src=\"https://gitlab.com/zyxue/gtf2csv-csvs/raw/master/human/figs/transcripts/lincRNA_transcripts.jpg\" alt width=\"100%\"\u003e\n\nAs seen, lincRNA hasn't been annotated until around GRCh37.57 (2010-03 based on\nhttps://www.gencodegenes.org/releases/).\n\nFor plots of other available transcript types, please see\n[here](https://gitlab.com/zyxue/gtf2csv-csvs/tree/master/human/figs/transcripts).\n\n\n### Comparison of gtf files across different species\n\nHere is a scatter plot of number of protein coding genes vs protein coding\ntranscripts for different species. Each dot is a species, but only those common\nones are annotated. For bar plots similar to above, see\n[here](https://gitlab.com/zyxue/gtf2csv-csvs/tree/master/ensembl-release-93/figs/transcripts).\n\n\u003cimg src=\"https://gitlab.com/zyxue/gtf2csv-csvs/raw/master/ensembl-release-93/figs/num_protein_coding_genes_vs_transcripts.jpg\" alt width=\"100%\"\u003e\n\nDetails of plot generation can be found at\n[Comparison-of-gtfs-across-species.ipynb](https://github.com/zyxue/gtf2csv/blob/master/notebooks/Comparison-of-gtfs-across-species.ipynb).\n\n\n### Conversion strategy\n\nThe parsing of GTF is based on GTF/GFF2 format specified at\nhttp://uswest.ensembl.org/info/website/upload/gff.html.\n\n**The key transformation steps**:\n\n1. ignore all lines starting with `#`.\n2. convert all columns but the attribute column to csv.\n3. Deal with attribute column.\n\nThe first two steps are straightforward. Note that GTF is tab-separated, so it\nis very similar to a csv file.\n\nThe attribute column is a bit more tricky to deal with. Each row of the\nattribute column contains a list of tag-value pairs. In principle, every tag\ncould form its own column. However, some tags could appear multiple times within\none row. A few observed such tags include:\n\n* `tag` tag as in [Ensembl human gtf files](ftp://ftp.ensembl.org/pub/release-93/gtf/homo_sapiens/)\n* `ont` tag as in [GENCODE human gtf files](https://www.gencodegenes.org/releases/current.html)\n* `ccds_id` as in [Ensembl for Mus_musculus related gtf files](ftp://ftp.ensembl.org/pub/release-93/gtf/mus_musculus_129s1svimj/)\n\nI named these tags are called multiplicity tags, and they are further classified\ninto two types depending on the number of possible unique values they have. For\nthose with a low number of possible values, thus low cardinality, each of their\npossible values would be transformed into its own binary column under the name\n([tag]:[value]). For example, for the follow `tag` tags,\n\n```\n... exon_id \"ENSE00001637883\"; tag \"cds_end_NF\"; tag \"mRNA_end_NF\";\n```\n\nIt would converted into values in two binary (1/0) columns with column names\n`tag:cds_end_NF` and `tag:mRNA_end_NF`. \n\nFor multiplicity tags with a high-cardinality (e.g. `ccds_id` with a cardinality\nover 20k), converting each value into its own column would result into to many\ncolumns and consume to much memory, thus the possible values would simply be\nconcatenated. For example, the following entry\n\n```\n... ccds_id \"CCDS14805\"; ccds_id \"CCDS78538\"; ccds_id \"CCDS78539\"; ...\n```\n\nwould become `CCDS14805,CCDS78538,CCDS78539` under the `ccds_id` column.\n\nThe cutoff between high-cardinality and low-cardinality tags could be specified\nvia `-c/--cardinality-cutoff` parameter.\n\n\n### Other resources\n\nFor a complete list of tags: https://www.gencodegenes.org/gencode_tags.html\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzyxue%2Fgtf2csv","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzyxue%2Fgtf2csv","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzyxue%2Fgtf2csv/lists"}