{"id":22914883,"url":"https://github.com/cdcgov/clean-genes","last_synced_at":"2026-03-01T17:35:55.862Z","repository":{"id":266752347,"uuid":"899202998","full_name":"CDCgov/clean-genes","owner":"CDCgov","description":"A rust crate that automatically cleans up a gene alignment by trimming to ORF and identifying and/or removing problematic sequences.","archived":false,"fork":false,"pushed_at":"2025-02-26T14:59:45.000Z","size":190,"stargazers_count":1,"open_issues_count":5,"forks_count":2,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-10-12T00:58:13.620Z","etag":null,"topics":["bioinformatics","cdc-influenza-division","data-cleaning","data-normalisation","data-normalization","data-science","fasta","ncird","ncird-id","sequence-alignment","sequence-analysis","sequence-annotation"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CDCgov.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-12-05T20:09:48.000Z","updated_at":"2025-02-08T01:35:03.000Z","dependencies_parsed_at":"2025-02-06T15:26:52.229Z","dependency_job_id":"33f12e25-d5fb-4b0b-92cf-443d6452a555","html_url":"https://github.com/CDCgov/clean-genes","commit_stats":null,"previous_names":["cdcgov/clean-genes"],"tags_count":0,"template":false,"template_full_name":"CDCgov/template","purl":"pkg:github/CDCgov/clean-genes","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CDCgov%2Fclean-genes","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CDCgov%2Fclean-genes/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CDCgov%2Fclean-genes/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CDCgov%2Fclean-genes/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CDCgov","download_url":"https://codeload.github.com/CDCgov/clean-genes/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CDCgov%2Fclean-genes/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29976279,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-01T16:35:47.903Z","status":"ssl_error","status_checked_at":"2026-03-01T16:35:44.899Z","response_time":124,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","cdc-influenza-division","data-cleaning","data-normalisation","data-normalization","data-science","fasta","ncird","ncird-id","sequence-alignment","sequence-analysis","sequence-annotation"],"created_at":"2024-12-14T05:17:31.086Z","updated_at":"2026-03-01T17:35:55.823Z","avatar_url":"https://github.com/CDCgov.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# clean-genes overview\n\n## Project goal\nTo automatically clean up a gene alignment by trimming to open reading frame (ORF) and identifying and/or removing problematic sequences. Basically an alignment normalization tool that expects genes as inputs and could have specific value for Influenza genes. This is intended to be a Rust crate designed to do a lot of work as quickly as possible.\n\n## Project title\ncurrent proposal: clean-genes\n\nJustification:\n1. The name is available as a software in general and as a Rust crate on crates.io\n2. Cleaning is what the software does\n3. Genes are the expected input\n\n## Motivation\n\nI have done this work more/less manually before and it takes a lot of time, is hard to explain, and is very susceptible to human error. Where I did this work a lot was when I started working with influenza at USDA-ARS. Because influenza genomes are small and a deep study can involve a great number of viruses this work was frequently necessary\n\n## Project plan\n\nI intend to build the software in a modular fashion such that it can:\n1. Be built in smaller pieces to demonstrate value iteratively\n2. Be utilized by people with different needs from what I'm familiar with\n3. Run more efficiently when the user does not want or need all available functionality\n\n### Potential modules\n- ORF trimming: Determines the reading frame based on stop-codon frequency and conservation. Optionally trims the sequence using these determined reading frames\n- Frameshift ID: Identifies sequences with frameshift mutations based on stop-codon frequency and conservation. Optionally removes these sequences and prompts the user to re-align and run this software again.\n- Poor Quality Filter: Identifies sequences with an abundance of missing data, overly short length, or significant gaps. Optionally removes these sequences\n- Taxonomic Outlier ID: Identifies sequences that don't belong in the alignment taxonomically. Could potentially use GC content, k-mers, and/or sequence similarity. I am resistant to using reference sequences for this, but it's not impossible. Optionally removes these sequences.\n- Codon Optimized ID: Identifies codon-optimized sequences like those created for vaccines based on codon distribution in the alignment and/or an understanding of what a normal codon distribution is. Optionally removes these sequences.\n- Gap Groups: Identifies groups of sequences with shared indel patterns. Also identifies outliers. Optionally outputs these sequences in seperate fasta files excluding the common gaps. \n\nWhen used together, the removal steps are no longer optional. For ID-only the modules must be run one-at-a-time.\n\nAlignment would likely be done in MAFFT. MAFFT is fairly fast and produces high-quality results consistently using defualt parameters.\n\n### Phase 1\nBuild the basic user-interface and the ORF trimming module.\n- ORF trimming\n\n### Phase 2\nAdd the Poor Quality Filter module.\n\nworkflow if all are included:\n1. Alignment (if needed)\n2. Poor Quality Filter\n3. Re-alignment\n4. ORF trimming\n\n### Phase 3\nAdd the Gap Groups module\n\nworkflow if all are included:\n1. Alignment (if needed)\n2. Poor Quality Filter\n3. Re-alignment\n4. Gap Groups\n5. ORF trimming\n\n### Phase 4\nAdd the Frameshift ID module\n\n1. Alignment (if needed)\n2. Poor Quality Filter\n3. Re-Alignment\n4. Gap Groups\n5. ORF trimming\n6. Frameshift ID\n7. Re-alignment\n\n### Phase 5\nAdd the Codon Optimized ID module\n\nworkflow if all are included:\n1. Alignment (if needed)\n2. Poor Quality Filter\n3. Re-Alignment\n4. Gap Groups\n5. ORF trimming\n6. Frameshift ID\n7. Re-alignment\n8. Codon Optimized ID\n9. Re-alignment\n\n### Phase 6\nAdd the Taxonomic Outlier module\n\nworkflow if all are included:\n1. Alignment (if needed)\n2. Poor Quality Filter\n3. Re-Alignment\n4. Gap Groups\n5. ORF trimming\n6. Frameshift ID\n7. Re-alignment\n8. Codon Optimized ID\n9. Re-alignment\n10. Taxonomic Outlier \n11. Re-alignment\n\n\n### Timeline\nThe time required for software development is very hard to predict. Things like VCM, holidays, unforseen challenges, and conferences would delay the schedule. Given that I am not solely focused on this project and it is holiday season I believe I can complete Phase 1 by the end of December. Optimization for performance may or may not be done before Phase 2, but correctness will be ensured. \n\nPhase 2 will likely be a little harder than Phase 1 due to multiple tasks being completed in 1 module as well as integration of alignment and could take a couple of months.\n\nPhase 3 should be similar in complexity to Phase 1 taking about a month.\n\nPhase 4 is seems like a bit of a harder problem and could take a couple of months.\n\nPhases 5 and 6 will be very challenging and I am hesitant to put a timeline on them at all.\n\nUnit tests, Regression tests, benchmarking, and fuzzing will all be used to ensure correctness and maximize performance.\n\nBetween each phase I intend to discuss my progress with my supervisor Brian Lee, my Rust mentor Sam Shepard, and other interested colleauges including Allen Kim who has previously expressed interest in contributing to this project. \n\n## Notices\n\n### Public Domain Standard Notice\nThis repository constitutes a work of the United States Government and is not\nsubject to domestic copyright protection under 17 USC § 105. This repository is in\nthe public domain within the United States, and copyright and related rights in\nthe work worldwide are waived through the [CC0 1.0 Universal public domain dedication](https://creativecommons.org/publicdomain/zero/1.0/).\nAll contributions to this repository will be released under the CC0 dedication. By\nsubmitting a pull request you are agreeing to comply with this waiver of\ncopyright interest.\n\n### License Standard Notice\nThe repository utilizes code licensed under the terms of the Apache Software\nLicense and therefore is licensed under ASL v2 or later.\n\nThis source code in this repository is free: you can redistribute it and/or modify it under\nthe terms of the Apache Software License version 2, or (at your option) any\nlater version.\n\nThis source code in this repository is distributed in the hope that it will be useful, but WITHOUT ANY\nWARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A\nPARTICULAR PURPOSE. See the Apache Software License for more details.\n\nYou should have received a copy of the Apache Software License along with this\nprogram. If not, see http://www.apache.org/licenses/LICENSE-2.0.html\n\nThe source code forked from other open source projects will inherit its license.\n\n### Privacy Standard Notice\nThis repository contains only non-sensitive, publicly available data and\ninformation. All material and community participation is covered by the\n[Disclaimer](DISCLAIMER.md)\nand [Code of Conduct](code-of-conduct.md).\nFor more information about CDC's privacy policy, please visit [http://www.cdc.gov/other/privacy.html](https://www.cdc.gov/other/privacy.html).\n\n### Contributing Standard Notice\nAnyone is encouraged to contribute to the repository by [forking](https://help.github.com/articles/fork-a-repo)\nand submitting a pull request. (If you are new to GitHub, you might start with a\n[basic tutorial](https://help.github.com/articles/set-up-git).) By contributing\nto this project, you grant a world-wide, royalty-free, perpetual, irrevocable,\nnon-exclusive, transferable license to all users under the terms of the\n[Apache Software License v2](http://www.apache.org/licenses/LICENSE-2.0.html) or\nlater.\n\nAll comments, messages, pull requests, and other submissions received through\nCDC including this GitHub page may be subject to applicable federal law, including but not limited to the Federal Records Act, and may be archived. Learn more at [http://www.cdc.gov/other/privacy.html](http://www.cdc.gov/other/privacy.html).\n\n### Records Management Standard Notice\nThis repository is not a source of government records, but is a copy to increase\ncollaboration and collaborative potential. All government records will be\npublished through the [CDC web site](http://www.cdc.gov).\n\n### Additional Standard Notices\nPlease refer to [CDC's Template Repository](https://github.com/CDCgov/template) for more information about [contributing to this repository](https://github.com/CDCgov/template/blob/main/CONTRIBUTING.md), [public domain notices and disclaimers](https://github.com/CDCgov/template/blob/main/DISCLAIMER.md), and [code of conduct](https://github.com/CDCgov/template/blob/main/code-of-conduct.md).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcdcgov%2Fclean-genes","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcdcgov%2Fclean-genes","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcdcgov%2Fclean-genes/lists"}