{"id":19427131,"url":"https://github.com/anergictcell/atg","last_synced_at":"2025-04-24T17:31:21.626Z","repository":{"id":38299516,"uuid":"408701505","full_name":"anergictcell/atg","owner":"anergictcell","description":"A Rust library and CLI tool to handle genomic transcripts","archived":false,"fork":false,"pushed_at":"2025-03-07T17:05:34.000Z","size":486,"stargazers_count":4,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-17T12:57:07.288Z","etag":null,"topics":["bioinformatics","bioinformatics-tool","cli-app","converter","genomics","gtf","transcriptomics"],"latest_commit_sha":null,"homepage":"https://crates.io/crates/atg","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/anergictcell.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-09-21T05:43:36.000Z","updated_at":"2024-05-22T18:09:53.000Z","dependencies_parsed_at":"2024-04-08T19:46:19.433Z","dependency_job_id":"a367570b-1e9f-48a1-a125-34f814c1a383","html_url":"https://github.com/anergictcell/atg","commit_stats":null,"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anergictcell%2Fatg","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anergictcell%2Fatg/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anergictcell%2Fatg/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anergictcell%2Fatg/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/anergictcell","download_url":"https://codeload.github.com/anergictcell/atg/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250674304,"owners_count":21469195,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","bioinformatics-tool","cli-app","converter","genomics","gtf","transcriptomics"],"created_at":"2024-11-10T14:10:31.840Z","updated_at":"2025-04-24T17:31:21.309Z","avatar_url":"https://github.com/anergictcell.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ATG\n\nConvert your genomic reference data between formats with a single tool. _ATG_ handles the conversion from and to GTF, GenePred(ext) and Refgene. You can generate bed files, fasta sequences or custom feature sequences. A single tool for all your conversion.\n\n| File format | Can be used as source | Can be created |\n| ----------- | ------------- | -------- |\n| GTF | Yes | Yes |\n| GenePred (extended) | Yes | Yes |\n| RefGene | Yes | Yes |\n| GenePred (simple) | No | Yes |\n| Bed | No | Yes |\n| Fasta | No | Yes (multiple options) |\n| SpliceAI gene annotation | No | Yes |\n| Quality Checks | No | Yes |\n\n\n**Reasons to use _ATG_**\n* No need to maintain multiple tools for one-way conversions (`gtfToGenePred`, `genePredToGtf`, etc). _ATG_ handles many formats and can convert in both directions.\n* Speed: _ATG_ is really fast - almost twice as fast as `gtfToGenePred`.\n* Robust parser: It handles GTF, GenePred with all extras according to spec.\n* Low memory footprint: It also runs on machines with little RAM.\n* Extra features, such as quality control and correctness checks.\n* Open for contributions: Every help is welcome improve ATG or to add more functionality.\n* You can also use _ATG_ as a library for your own Rust projects.\n\n\n## ATG command line tool\n\n### Install\nThere are currently 3 different options how to install _ATG_:\n\n##### cargo\nThe easiest way to install _ATG_ is to use `cargo` (if you have `cargo` and `rust` installed)\n```bash\ncargo install atg\n```\n\n##### Pre-built binaries\nYou can download pre-built binaries for Linux and Mac from [Github](https://github.com/anergictcell/atg/releases).\n\n##### From source\nYou can also build _ATG_ from source (if you have the rust toolchains installed):\n\n```bash\ngit clone https://github.com/anergictcell/atg.git\ncd atg\ncargo build --release\n```\n\n\n### Usage\nThe main CLI arguments are \n- `-f`, `--from`: Specify the file format of the source (e.g. `gtf`, `genepredext`, `refgene`)\n- `-t`, `--to`: Specify the target file format (e.g. `gtf`, `genepred`, `bed`, `fasta` etc)\n- `-i`, `--input`: Path to source file. (Use `/dev/stdin` if you are using _atg_ in a pipe)\n- `-o`, `--output`: Path to target file. Existing files will be overwritten. (Use `/dev/stdout` if you are using _atg_ in a pipe)\n- `-v`, `-vv`, `-vvv`: Verbosity (info, debug, trace)\n- `-h`, `--help`: Print the help dialog with detailed usage instructions.\n\nAdditional, optional arguments:\n- `-g`, `--gtf-source`: Specify the source for GTF output files. Defaults to `atg`\n- `-r`, `--reference`: Path of a reference genome fasta file. Required for fasta output\n- `-c`, `--genetic-code`: Specify which genetic code to use for translating the transcripts. Genetic codes can be specified per chromosome by specifying the chromsome and the code, separated by `:` (e.g. `-c chrM:vertebrate mitochondrial`). They can also be specified for all chromsomes by omitting the chromosome (e.g. `-c vertebrate mitochondrial`). The argument can be specified multiple times (e.g: `-c \"standard\" -c \"chrM:vertebrate mitochondrial\" -c \"chrAYN:alternative yeast nuclear\"`). The code names are based on the `name` field from the [NCBI specs](https://www.ncbi.nlm.nih.gov/IEB/ToolBox/C_DOC/lxr/source/data/gc.prt) but all lowercase characters. Alternatively, you can also specify the amino acid lookup table directly: `-c \"chrM:FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSS**VVVVAAAADDEEGGGG\"`. Defaults to `standard`.\n- `-q`, `--qc-check`: Specify QC-checks for removing transcripts from the output\n\n#### Examples:\n```bash\n## Convert a GTF file to a RefGene file\natg --from gtf --to refgene --input /path/to/input.gtf --output /path/to/output.refgene\n\n## Convert a GTF file to a GenePred file\natg --from gtf --to genepred --input /path/to/input.gtf --output /path/to/output.genepred\n\n## Convert a GTF file to a GenePredExt file\natg --from gtf --to genepredext --input /path/to/input.gtf --output /path/to/output.genepredext\n\n## Convert RefGene to GTF\natg --from refgene --to gtf --input /path/to/input.refgene --output /path/to/output.gtf\n\n## Convert RefGene to bed\natg --from refgene --to bed --input /path/to/input.refgene --output /path/to/output.bed\n\n\n## Convert a GTF file to a RefGene file, remove all transcript without proper start and stop codons\natg --from gtf --to refgene --input /path/to/input.gtf --output /path/to/output.refgene --qc-check start --qc-check stop --reference /path/to/fasta.fa\n```\n\n### Supported `--output` formats\n\n#### gtf\nOutput in [GTF](http://genome.ucsc.edu/FAQ/FAQformat.html#format4) format.\n\n```text\nchr9    ncbiRefSeq.2021-05-17   transcript  74526555    74600974    .   +   .   gene_id \"C9orf85\"; transcript_id \"NM_001365057.2\";\nchr9    ncbiRefSeq.2021-05-17   exon        74526555    74526752    .   +   .   gene_id \"C9orf85\"; transcript_id \"NM_001365057.2\";\nchr9    ncbiRefSeq.2021-05-17   5UTR        74526555    74526650    .   +   .   gene_id \"C9orf85\"; transcript_id \"NM_001365057.2\";\nchr9    ncbiRefSeq.2021-05-17   CDS         74526651    74526752    .   +   0   gene_id \"C9orf85\"; transcript_id \"NM_001365057.2\";\nchr9    ncbiRefSeq.2021-05-17   exon        74561922    74562028    .   +   .   gene_id \"C9orf85\"; transcript_id \"NM_001365057.2\";\nchr9    ncbiRefSeq.2021-05-17   CDS         74561922    74562026    .   +   0   gene_id \"C9orf85\"; transcript_id \"NM_001365057.2\";\n...\n```\n\nYou can specify the value of the `source` column manually using the `--gtf-source`/`-g` option. Defaults to `atg`\n\n#### refgene\nOutput in the [refGene](http://rohsdb.cmb.usc.edu/GBshape/cgi-bin/hgTables?hgsid=583_AkEae6dMkhjf5kd9BxNksFo9ySiK\u0026hgta_doSchemaDb=mm10\u0026hgta_doSchemaTable=refGene) format, as used by some UCSC and NCBI RefSeq services \n\n```text\n0   NM_001101.5     chr7    -   5566778     5570232     5567378     5569288    6   5566778,5567634,5567911,5568791,5569165,5570154,    5567522,5567816,5568350,5569031,5569294,5570232,    0   ACTB    cmpl    cmpl    0,1,0,0,0,-1,\n0   NM_001203247.2  chr7    -   148504474   148581383   148504737   148544390  20  148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543561,148544273,148581255,    148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383,    0   EZH2    cmpl    cmpl    2,1,1,0,0,1,1,2,0,1,0,1,2,1,1,0,0,0,0,-1,\n0   NM_001203248.2  chr7    -   148504474   148581383   148504737   148544390  20  148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543588,148544273,148581255,    148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383,    0   EZH2    cmpl    cmpl    2,1,1,0,0,1,1,2,0,1,0,1,2,1,1,0,0,0,0,-1,\n0   NM_001354750.2  chr11   +   113930432   114127487   113934022   114121277  7   113930432,113933932,114027058,114057673,114112888,114117919,114121047,  113930864,113935290,114027156,114057760,114113059,114118087,114127487,  0   ZBTB16  cmpl    cmpl    -1,0,2,1,1,1,1,\n```\n\n#### genepred(ext)\nOutput in the [GenePred(Ext)](http://genome.ucsc.edu/FAQ/FAQformat#format9) format, as used by some UCSC and NCBI RefSeq services \n\n**GenePred:**\n```text\nNM_001101.5     chr7    -   5566778     5570232     5567378     5569288     6   5566778,5567634,5567911,5568791,5569165,5570154,    5567522,5567816,5568350,5569031,5569294,5570232,\nNM_001203247.2  chr7    -   148504474   148581383   148504737   148544390   20  148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543561,148544273,148581255,    148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383,\nNM_001203248.2  chr7    -   148504474   148581383   148504737   148544390   20  148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543588,148544273,148581255,    148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383,\nNM_001354750.2  chr11   +   113930432   114127487   113934022   114121277   7   113930432,113933932,114027058,114057673,114112888,114117919,114121047,  113930864,113935290,114027156,114057760,114113059,114118087,114127487,\n```\n\n**GenePredExt**\n```text\nNM_001101.5     chr7    -   5566778     5570232     5567378     5569288     6   5566778,5567634,5567911,5568791,5569165,5570154,    5567522,5567816,5568350,5569031,5569294,5570232,    0   ACTB    cmpl    cmpl    0,1,0,0,0,-1,\nNM_001203247.2  chr7    -   148504474   148581383   148504737   148544390   20  148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543561,148544273,148581255,    148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383,    0   EZH2    cmpl    cmpl    2,1,1,0,0,1,1,2,0,1,0,1,2,1,1,0,0,0,0,-1,\nNM_001203248.2  chr7    -   148504474   148581383   148504737   148544390   20  148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543588,148544273,148581255,    148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383,    0   EZH2    cmpl    cmpl    2,1,1,0,0,1,1,2,0,1,0,1,2,1,1,0,0,0,0,-1,\nNM_001354750.2  chr11   +   113930432   114127487   113934022   114121277   7   113930432,113933932,114027058,114057673,114112888,114117919,114121047,  113930864,113935290,114027156,114057760,114113059,114118087,114127487,  0   ZBTB16  cmpl    cmpl    -1,0,2,1,1,1,1,\n```\n\n#### bed\nOutput in [bed](http://genome.ucsc.edu/FAQ/FAQformat#format1) format.\n\n```text\nchr7    5566778     5570232     ACTB:NM_001101.5       -   5567378    5569288    212,16,48   6   744,182,439,240,129,78  0,856,1133,2013,2387,3376\nchr11   113930432   114127487   ZBTB16:NM_001354750.2  +   113934022  114121277  212,16,48   7   432,1358,98,87,171,168,6440 0,3500,96626,127241,182456,187487,190615\nchr17   40852292    40897058    EZH1:NM_001321082.2    -   40854549   40880959   212,16,48   20  2318,85,81,82,96,179,126,41,92,197,181,92,164,103,177,121,129,128,91,30 0,2602,3465,4327,4813,5732,7683,8601,9571,12014,12934,17701,18179,18830,19998,22520,27360,28550,30553,44736\n```\n\n#### fasta\nWrites the cDNA sequence of all transcripts into one file. Please note that the sequence is stranded.\n\nThis target format requires a reference genome fasta file that must be specified using `--reference`/`-r`.\n\n*This output allows different `--fasta-format` options:*\n- `transcript`: The full transcript sequence (from the genomic start to end position, including introns)\n- `exons`: The cDNA sequence of the processed transcript, i.e. the sequence of all exons, including non-coding exons.\n- `cds` (default): The CDS of the transcript\n\n```text\n\u003eNM_007298.3 BRCA1\nATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGC\nTATGCAGAAAATCTTAGAGTGTCCCATCTGTCTGGAGTTGATCAAGGAAC\nCTGTCTCCACAAAGTGTGACCACATATTTTGCAAATTTTGCATGCTGAAA\nCTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGA\nTATAACCAAAAGGAGCCTACAAGAAAGTACGAGATTTAGTCAACTTGTTG\n...\n\u003eNM_001365057.2 C9orf85\nATGAGCTCCCAGAAAGGCAACGTGGCTCGTTCCAGACCTCAGAAGCACCA\nGAATACGTTTAGCTTCAAAAATGACAAGTTCGATAAAAGTGTGCAGACCA\nAGAAAATTAATGCAAAACTTCATGATGGAGTATGTCAGCGCTGTAAAGAA\nGTTCTTGAGTGGCGTGTAAAATACAGCAAATACAAACCATTATCAAAACC\nTAAAAAGTGA\n...\n```\n\n#### fasta-split\nLike `fasta` above, but one file for each transcript. Instead of an output file, you must specify an output directory, _ATG_ will save each transcript as `\u003cTranscript_name\u003e.fasta`, e.g.: `NM_001365057.2.fasta`.\n\nThis target format requires a reference genome fasta file that must be specified using `--reference`/`-r`.\n\n*This output allows different `--fasta-format` options:*\n- `transcript`: The full transcript sequence (from the genomic start to end position, including introns)\n- `exons`: The cDNA sequence of the processed transcript, i.e. the sequence of all exons, including non-coding exons.\n- `cds` (default): The CDS of the transcript\n\n#### feature-sequence\ncDNA sequence of each feature (5' UTR, CDS, 3'UTR), each in a separate row.\n\nThis target format requires a reference genome fasta file that must be specified using `--reference`/`-r`.\n\n```text\nBRCA1   NM_007298.3     chr17   41196311    41197694    -   3UTR    CTGCAGCCAGCCAC...\nBRCA1   NM_007298.3     chr17   41197694    41197819    -   CDS     CAATTGGGCAGATGTGTG...\nBRCA1   NM_007298.3     chr17   41199659    41199720    -   CDS     GGTGTCCACCCAATTGTG...\nBRCA1   NM_007298.3     chr17   41201137    41201211    -   CDS     ATCAACTGGAATGGATGG...\nBRCA1   NM_007298.3     chr17   41203079    41203134    -   CDS     ATCTTCAGGGGGCTAGAA...\nBRCA1   NM_007298.3     chr17   41209068    41209152    -   CDS     CATGATTTTGAAGTCAGA...\nBRCA1   NM_007298.3     chr17   41215349    41215390    -   CDS     GGGTGACCCAGTCTATTA...\nBRCA1   NM_007298.3     chr17   41215890    41215968    -   CDS     ATGCTGAGTTTGTGTGTG...\nBRCA1   NM_007298.3     chr17   41219624    41219712    -   CDS     ATGCTCGTGTACAAGTTT...\nBRCA1   NM_007298.3     chr17   41222944    41223255    -   CDS     AGGGAACCCCTTACCTGG...\nC9orf85 NM_001365057.2  chr9    74526555    74526650    +   5UTR    ATTGACAGAA...\nC9orf85 NM_001365057.2  chr9    74526651    74526752    +   CDS     ATGAGCTCCCAGAA...\nC9orf85 NM_001365057.2  chr9    74561922    74562028    +   CDS     AAAATTAATGCAAA...\nC9orf85 NM_001365057.2  chr9    74597573    74597573    +   CDS     A\nC9orf85 NM_001365057.2  chr9    74597574    74600974    +   3UTR    TGGAGTCTCC...\n```\n\n#### spliceai\nThis is a custom format useful for [SpliceAI](https://github.com/Illumina/SpliceAI)\nsplice predictions. The repo lists [example files](https://github.com/Illumina/SpliceAI/tree/master/spliceai/annotations).\nThe output has one gene per row, each gene record contains a consensus transcript, created by merging overlapping exons.\n\n```text\n#NAME       CHROM   STRAND  TX_START    TX_END  EXON_START      EXON_END\nOR4F5       1       +       69090       70008   69090,          70008,\nAL627309.1  1       -       134900      139379  134900,137620,  135802,139379,\n```\n\n#### qc\nRuns some basic consistency checks on the transcripts:\n\n| QC check | Explanation | Non-Coding vs Coding | requires Fasta File |\n| --- | --- | --- | --- |\n| Exon | Contains at least one exon | all | no |\n| Correct CDS Length | The length of the CDS is divisible by 3 | Coding | no |\n| Correct Start Codon | The CDS starts with `ATG` | Coding | yes |\n| Correct Stop Codon | The CDS ends with a Stop codon `TAG`, `TAA`, or `TGA` | Coding | yes |\n| No upstream Start Codon | The 5'UTR does not contain another start codon `ATG` (This test do not make sense biologically. It is totally fine for a transcript to have upstream `ATG` start cordons that are not utilized but the ribosome.) | Coding | yes |\n| No upstream Stop Codon| The CDS does not contain another in-frame stop-codon | Coding | yes |\n| No Start codon | The full exon sequence does not contain a start codon `ATG` (Biologically speaking, a non-coding transcript could have `ATG` start codons that are not utilized) | Non-Coding | yes |\n| Correct Coordinates | The transcript is within the coordinates of the reference genome | all | yes |\n\n**Test results:**\n- `NA` Test could not be performed (e.g. CDS-length for non-coding transcripts), so no conclusion could be drawn\n- `OK` The test succeeded with an OK results\n- `NOK` The test failed and gave a NOT OK result\n\n```text\nGene     transcript     Exon  CDS Length  Correct Start Codon  Correct Stop Codon  No upstream Start Codon  No upstream Stop Codon  Correct Coordinates\nFAM239A  NR_146581.1    OK    N/A         N/A                  N/A                 OK                       N/A                     OK\nOR5H2    NM_001005482.1 OK    OK          OK                   OK                  OK                       OK                      OK\nSNX20    NM_001144972.2 OK    OK          OK                   OK                  NOK                      OK                      OK\n```\n\n#### raw\nThis is mainly useful for debugging, as it gives a quick glimpse into the Exons and CDS coordinates of the transcripts.\n\n#### bin\nSave Transcripts in _ATG_ binary format for faster re-reading.\n\n\n## ATG as library\n_ATG_ uses the _atglib_ library, which is documented inline and available on [docs.rs](https://docs.rs/atglib)\n\n\n## Known issues\n### GTF parsing\n- [ ] NM_001371720.1 has two book-ended exons (155160639-155161619 || 155161620-155162101). During input parsing, book-ended features are merged into one exon\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanergictcell%2Fatg","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fanergictcell%2Fatg","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanergictcell%2Fatg/lists"}