{"id":24745807,"url":"https://github.com/bibymaths/sars_genome_assembly","last_synced_at":"2025-03-23T00:22:05.404Z","repository":{"id":197806571,"uuid":"697021623","full_name":"bibymaths/sars_genome_assembly","owner":"bibymaths","description":"A complete pipeline for assembling and analyzing the SARS-CoV-2 genome from Illumina paired-end reads, featuring quality control, mapping, variant calling, consensus sequence generation, lineage annotation, and phylogenetic analysis. Designed for efficient and modular bioinformatics workflows.","archived":false,"fork":false,"pushed_at":"2025-01-18T20:58:44.000Z","size":6674,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-28T03:30:53.718Z","etag":null,"topics":["genome-assembly","illumina","sars-cov-2","variantcalling"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bibymaths.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-09-26T22:28:39.000Z","updated_at":"2025-01-18T21:11:41.000Z","dependencies_parsed_at":"2025-01-18T21:34:06.325Z","dependency_job_id":null,"html_url":"https://github.com/bibymaths/sars_genome_assembly","commit_stats":null,"previous_names":["bibymaths/sarscov2"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bibymaths%2Fsars_genome_assembly","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bibymaths%2Fsars_genome_assembly/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bibymaths%2Fsars_genome_assembly/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bibymaths%2Fsars_genome_assembly/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bibymaths","download_url":"https://codeload.github.com/bibymaths/sars_genome_assembly/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245037437,"owners_count":20550871,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["genome-assembly","illumina","sars-cov-2","variantcalling"],"created_at":"2025-01-28T03:29:47.987Z","updated_at":"2025-03-23T00:22:05.358Z","avatar_url":"https://github.com/bibymaths.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Overview\nThis repository provides a complete pipeline for assembling and analyzing the genome of SARS-CoV-2 using Illumina paired-end sequencing data. It includes steps for quality control, mapping, variant calling, primer clipping, consensus sequence generation, lineage annotation, and phylogenetic analysis.\n\n### Key Features\n- Automated environment setup using `mamba` and `conda`\n- Comprehensive quality control with `fastqc` and `fastp`\n- Mapping and visualization using `minimap2`, `samtools`, and `IGV`\n- Primer sequence clipping for clean alignments\n- Variant calling with `freebayes` and VCF filtering with `vcfR`\n- Consensus sequence generation and lineage assignment with `pangolin`\n- Phylogenetic analysis and multiple sequence alignment with `mafft` and `iqtree`\n- Clear documentation and modular structure\n\n## System Requirements\n- **Operating System**: Linux (tested on Fedora 38)\n- **Processor**: Intel i5 or equivalent, with multithreading support\n- **Memory**: Minimum 8 GB\n- **Software**: Anaconda/Miniconda, mamba, R, and the listed bioinformatics tools\n\n## Dependencies\nThe pipeline requires the following tools, managed via `mamba`:\n- QC: `fastqc`, `fastp`, `multiqc`\n- Mapping: `minimap2`, `samtools`, `bamclipper`\n- Variant Calling: `freebayes`, `vcftools`, `bcftools`\n- Sequence Analysis: `vcfR`, `mafft`, `iqtree`, `pangolin`\n- Visualization: `gnuplot`, `IGV`, `jalview`\n\n## Installation\n1. Clone the repository:\n   ```bash\n   git clone https://github.com/\u003cusername\u003e/sars2-genome-assembly.git\n   cd sars2-genome-assembly\n   ```\n2. Install `mamba` and create the environment:\n   ```bash\n   wget \"https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh\"\n   bash Mambaforge-Linux-x86_64.sh\n   conda update -y conda\n   mamba env create -p ./envs/projectSARS --file environment.yaml\n   mamba activate ./envs/projectSARS\n   ```\n\n## Pipeline Workflow\n1. **Environment Setup**: Install dependencies and configure the environment.\n2. **Data Preparation**: Download input datasets and reference genomes.\n3. **Quality Control**: Evaluate and preprocess raw sequencing reads.\n4. **Mapping**: Align reads to the reference genome.\n5. **Primer Clipping**: Remove primer sequences from alignments.\n6. **Variant Calling**: Identify variants in the genome.\n7. **Filtering \u0026 Masking**: Use an R script for QC and filtering of VCF files.\n8. **Consensus Generation**: Generate consensus sequences from filtered variants.\n9. **Lineage Annotation**: Assign SARS-CoV-2 lineages using `pangolin`.\n10. **Phylogenetic Analysis**: Perform multiple sequence alignment and build phylogenetic trees.\n\n## Input Data\n- Illumina paired-end sequencing data\n- SARS-CoV-2 reference genome (NCBI accession: NC_045512.2)\n\n## Output\n- Quality control reports (`.html`, `.json`)\n- Aligned sequences in BAM and VCF formats\n- Consensus sequences in FASTA format\n- Lineage annotations\n- Phylogenetic trees and visualizations\n\n## Usage\n1. Edit the `config.sh` file to specify input data paths and parameters.\n2. Run the pipeline:\n   ```bash\n   bash scripts/run_pipeline.sh\n   ```\n3. View results in the `results/` directory.\n\n## References\nThis pipeline builds on the work of numerous bioinformatics tools and methodologies. Key references include:\n- Heng Li, Minimap2: pairwise alignment for nucleotide sequences, Bioinformatics, 2018\n- Petr Danecek, James K Bonfield, et al., SAMtools and BCFtools, GigaScience, 2021\n- Áine O'Toole, Anthony Underwood, et al., Pangolin tool, Virus Evolution, 2021\n- Katoh, K., \u0026 Standley, D. M., MAFFT multiple sequence alignment software, Molecular Biology and Evolution, 2013\n\n## Acknowledgments\nThis project was developed as part of the SARS-2 Bioinformatics \u0026 Data Science course by the Freie Universität Berlin and the Robert Koch Institute. Special thanks to Max von Kleist and Martin Hölzer for their guidance.\n\n## License\nThis project is licensed under the MIT License. See the `LICENSE` file for details.\n\n## Contact\nFor questions or issues, please contact:\n- **Abhinav Mishra**\n- Email: mishraabhinav36@gmail.com\n\n![alt_text](methodSARS.svg) \n\n### Data \u0026 File description  \n![alt_text](filedesc.png) \n\n## References \n\n1. bamstats.pl script, 2012-2014 Genome Research Ltd (Author: Petr Danecek \u003cpd3@sanger.ac.uk\u003e)\n2. Anon, 2020. Anaconda Software Distribution, Anaconda Inc. Available at: https://docs.anaconda.com/.\n3. Philip Ewels, Måns Magnusson, Sverker Lundin, Max Käller, MultiQC: summarize analysis results for\nmultiple tools and samples in a single report, Bioinformatics, Volume 32, Issue 19, October 2016, Pages\n3047–3048, https://doi.org/10.1093/bioinformatics/btw354\n4. Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online].\n5. Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu, fastp: an ultra-fast all-in-one FASTQ preprocessor,\nBioinformatics, Volume 34, Issue 17, September 2018, Pages i884–i890, https://doi.org/10.1093/bioinformatics/bty560\n6. Heng Li, Minimap2: pairwise alignment for nucleotide sequences, Bioinformatics, Volume 34,\nIssue 18, September 2018, Pages 3094–3100, https://doi.org/10.1093/bioinformatics/bty191\n7. Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O\nPollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li,\nTwelve years of SAMtools and BCFtools, GigaScience, Volume 10, Issue 2, February 2021,\ngiab008, https://doi.org/10.1093/gigascience/giab008\n8. Robinson, J. T., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E. S., Getz, G., \u0026\nMesirov, J. P. (2011). Integrative genomics viewer. Nature biotechnology, 29(1), 24–26. https://doi.org/10.1038/nbt.1754\n9. O'Toole Á, Hill V, Pybus OG et al. Tracking the international spread of SARS-CoV-2 lineages B.1.1.7 and\nB.1.351/501Y-V2 [version 1; peer review: 3 approved]. Wellcome Open Res 2021, 6:121 (https://doi.org/10.12688/wellcomeopenres.16661.1)\n10. Áine O’Toole, Emily Scher, Anthony Underwood, Ben Jackson, Verity Hill, John T McCrone, Rachel\nColquhoun, Chris Ruis, Khalil Abu-Dahab, Ben Taylor, Corin Yeats, Louis du Plessis, Daniel Maloney, Nathan\nMedd, Stephen W Attwood, David M Aanensen, Edward C Holmes, Oliver G Pybus, Andrew Rambaut,\nAssignment of epidemiological lineages in an emerging pandemic using the pangolin tool, Virus Evolution,\nVolume 7, Issue 2, December 2021, veab064, https://doi.org/10.1093/ve/veab064\n11. Rambaut, A., Holmes, E.C., O’Toole, Á. et al. A dynamic nomenclature proposal for SARS-CoV-2 lineages to\nassist genomic epidemiology. Nat Microbiol 5, 1403–1407 (2020). https://doi.org/10.1038/s41564-020-0770-5\n12. Tool for QC with consensus sequences https://github.com/rki-mf1/president\n13. Au, C., Ho, D., Kwong, A. et al. BAMClipper: removing primers from alignments to\nminimize false-negative mutations in amplicon next-generation sequencing. Sci Rep 7, 1567\n(2017). https://doi.org/10.1038/s41598-017-01703-6\n14. Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing.\narXiv preprint arXiv:1207.3907 [q-bio.GN] 2012\n15. Vcflib and tools for processing the VCF variant call format. Erik Garrison, Zev N.\nKronenberg, Eric T. Dawson, Brent S. Pedersen, Pjotr Prins. bioRxiv 2021.05.21.445151; doi:\nhttps://doi.org/10.1101/2021.05.21.445151\n16. Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert E. Handsaker,\nGerton Lunter, Gabor T. Marth, Stephen T. Sherry, Gilean McVean, Richard Durbin, 1000 Genomes Project Analysis\nGroup, The variant call format and VCFtools, Bioinformatics, Volume 27, Issue 15, August 2011, Pages 2156–2158, https://doi.org/10.1093/bioinformatics/btr330\n17. Waterhouse, A.M., Procter, J.B., Martin, D.M.A, Clamp, M., Barton, G.J (2009), \"Jalview version 2: A Multiple Sequence\nAlignment and Analysis Workbench,” Bioinformatics 25 (9) 1189-1191 doi: 10.1093/bioinformatics/btp033\n18. Nguyen, L. T., Schmidt, H. A., von Haeseler, A., \u0026 Minh, B. Q. (2015). IQ-TREE: a fast and effective stochastic algorithm\nfor estimating maximum-likelihood phylogenies. Molecular biology and evolution, 32(1), 268–274. https://doi.org/10.1093/molbev/msu300\n19. Bui Quang Minh, Heiko A Schmidt, Olga Chernomor, Dominik Schrempf, Michael D Woodhams, Arndt von Haeseler,\nRobert Lanfear, IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era, Molecular\nBiology and Evolution, Volume 37, Issue 5, May 2020, Pages 1530–1534, https://doi.org/10.1093/molbev/msaa015\n20. Aaron R. Quinlan, Ira M. Hall, BEDTools: a flexible suite of utilities for comparing genomic features,\nBioinformatics, Volume 26, Issue 6, March 2010, Pages 841–842, https://doi.org/10.1093/bioinformatics/btq033\n21. Katoh, K., \u0026 Standley, D. M. (2013). MAFFT multiple sequence alignment software version 7: improvements in\nperformance and usability. Molecular biology and evolution, 30(4), 772–780. https://doi.org/10.1093/molbev/mst010\n22. R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical\nComputing, Vienna, Austria. URL https://www.R-project.org/.\n23. Knaus, B.J. and Grünwald, N.J. (2017), vcfr: a package to manipulate and visualize variant call format data in R.\nMol Ecol Resour, 17: 44-53. https://doi.org/10.1111/1755-0998.12549\n24. National Center for Biotechnology Information (NCBI)[Internet]. Bethesda (MD): National Library of\nMedicine (US), National Center for Biotechnology Information; [1988] – [cited 2023 Sep 29]. Available from:\nhttps://www.ncbi.nlm.nih.gov/\n\n### Device Info \n\nThe analysis and results was done and generated on \n\n**OS**          Fedora Linux 38 \u003cbr\u003e\n**Kernel**      Linux 6.4.15-200.fc38.x86_64 \u003cbr\u003e\n**Processor**   Intel i5-8250U (8 slots), with CUDA support \u003cbr\u003e\n**Graphics**    UHD 620 (KBL GT2) \u003cbr\u003e\n**Memory**      8 GB \n\n# SARS-CoV-2 genome assembly from Illumina reads \nCourse: [SARS-2 Bioinformatics \u0026 Data Science](https://github.com/rki-mf1/2023-SC2-Data-Science) \u003cbr\u003e\nIntructors: Max von Kleist, Martin Hölzer \u003cbr\u003e\nInstitution: Freie Universität Berlin, Robert-Koch Institute \u003cbr\u003e \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbibymaths%2Fsars_genome_assembly","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbibymaths%2Fsars_genome_assembly","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbibymaths%2Fsars_genome_assembly/lists"}