{"id":22612223,"url":"https://github.com/ivanbgd/bioinf_demo","last_synced_at":"2025-03-28T23:44:36.884Z","repository":{"id":228903627,"uuid":"775227424","full_name":"ivanbgd/bioinf_demo","owner":"ivanbgd","description":"A Bioinformatics demo in Python working with FASTQ files and using the Modin library","archived":false,"fork":false,"pushed_at":"2024-03-21T01:58:18.000Z","size":28,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-03T10:11:56.417Z","etag":null,"topics":["bioinformatics","biopython","computational-biology","fastq","larger-than-memory","modin","python","python3","trie"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ivanbgd.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-03-21T01:44:53.000Z","updated_at":"2024-03-21T02:00:39.000Z","dependencies_parsed_at":"2024-03-21T02:51:06.165Z","dependency_job_id":"0bf53c66-656a-4e52-8f9c-0e82e8af74ad","html_url":"https://github.com/ivanbgd/bioinf_demo","commit_stats":null,"previous_names":["ivanbgd/bioinf_demo"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ivanbgd%2Fbioinf_demo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ivanbgd%2Fbioinf_demo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ivanbgd%2Fbioinf_demo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ivanbgd%2Fbioinf_demo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ivanbgd","download_url":"https://codeload.github.com/ivanbgd/bioinf_demo/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246117690,"owners_count":20726068,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","biopython","computational-biology","fastq","larger-than-memory","modin","python","python3","trie"],"created_at":"2024-12-08T17:11:28.718Z","updated_at":"2025-03-28T23:44:36.864Z","avatar_url":"https://github.com/ivanbgd.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# A Bioinformatics Demo in Python\n\n## Description\n\n### The Goal\nThe goal was to write a new program without using existing tools to solve the following problem.  \nWe have a FASTQ file in the GZIP format and an adapter sequence list as the inputs.  \nWe would like to implement some filtering on the FASTQ file and output clean results also in the GZIP format.  \nThe filtering steps include poly (X) detection and adapter search.  \nAside of FASTQ, we'd also like to output the number of each filter item, such as how many reads were filtered by polyX \nor by an adapter.  \nWe wish to have a high-performance design based on multiple threads for I/O and for the algorithm of sequences search.\n\n### The Filtering Process (in order)\n1. Search for polyX whose length is less than or equal to 15 in the sequence, allowing 1 mismatch.\nIf poly-X is found, the read will be discarded.\n2. If the poly-X is not found, then search for the adapter whose length is less than or equal to 32 bp\nin the sequence without any mismatch.  \nIf the adapter is found, which means it is identical to a part of the sequence, the record is to be discarded.\n3. If the adapter is not found either, the read shall be considered as a clean one and shall be output to the\nresult file.\n4. Count how many records are filtered due to the existence of poly-X and adapter respectively.\n\n### Test data\n- Inputs: a FASTQ file (input.fq.gz) and an adapter sequence list (adapter.list)  \n- Outputs: a FASTQ file (out.fq.gz) and filter stat (out.stat.txt)  \n\n### Glossary\n- [FASTQ](http://maq.sourceforge.net/fastq.shtml): Stores sequences and Phred qualities in a single file.  \n- polyX: Include polyA, polyC, polyG, polyT, for example: AAAAAAAAAAAAAAA, CCCCCCCCCCCCCC.\n- Adapter: A short DNA sequence that includes A/C/G/T bases which should be trimmed or discarded from a read.\n\n## Activating Virtual Environment\nOn Windows:  \n`.\\venv\\Scripts\\activate`\n\n## Running the program\nRun the Python interpreter and the program from the main project directory, named *bioinf_demo*.  \nThe main \"executable\" file is located inside the *bin* subdirectory.  \n`python bin/bioinf_demo.py`\n\n## Implementation\n- Has both Naive and Prefix Trie Matching\n- Has both sequential and parallel implementations\n  - [*Modin*](https://modin.readthedocs.io/en/stable/)\n\nModin is **very** easy to translate to from Pandas.\nIt preserves the order of records in the output file. \nModin can be used on a single computer (shared-memory model), or in a cluster (distributed-memory model).\nIt can read and process larger-than-memory files.\n\n## Remarks \u0026 Notes\n- FASTQ files are mostly used to store short-read data from high-throughput sequencing experiments. \n  The sequence and quality scores are usually put into a single line each, and indeed many tools assume that\n  each record in a FASTQ file is exactly four lines long, even though this isn’t guaranteed.\n  [Source](https://bioinformatics.stackexchange.com/questions/14/what-is-the-difference-between-fasta-fastq-and-sam-file-formats).\n\n## Potential Improvements\n- [*biopython*](https://biopython.org/) supports [*PyPy*](https://www.pypy.org/) as documented on its\n  [PyPI](https://pypi.org/project/biopython/) page.\n- [*pyfastx*](https://pyfastx.readthedocs.io/en/latest/) didn't support Python 3.10 at the time of experimenting;\n  it didn't want to install.\n\n## Run Tests\n\n### Unit Tests\n\nThe `unittest` runner will run both unit tests and doctests if they exist.\n\n`python -m unittest`\n\nFor verbose output:\n\n`python -m unittest -v`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fivanbgd%2Fbioinf_demo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fivanbgd%2Fbioinf_demo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fivanbgd%2Fbioinf_demo/lists"}