{"id":18400636,"url":"https://github.com/lenaschimmel/sc2rf","last_synced_at":"2025-04-07T06:33:35.188Z","repository":{"id":40936171,"uuid":"466841248","full_name":"lenaschimmel/sc2rf","owner":"lenaschimmel","description":"SARS-Cov-2 Recombinant Finder for fasta sequences","archived":false,"fork":false,"pushed_at":"2022-06-22T16:10:48.000Z","size":810,"stargazers_count":49,"open_issues_count":22,"forks_count":13,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-03-22T14:11:13.301Z","etag":null,"topics":["covid","genetic","mutations","recombinants","sars-cov-2"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lenaschimmel.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-03-06T19:55:44.000Z","updated_at":"2025-03-20T13:21:00.000Z","dependencies_parsed_at":"2022-09-03T17:10:53.233Z","dependency_job_id":null,"html_url":"https://github.com/lenaschimmel/sc2rf","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lenaschimmel%2Fsc2rf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lenaschimmel%2Fsc2rf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lenaschimmel%2Fsc2rf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lenaschimmel%2Fsc2rf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lenaschimmel","download_url":"https://codeload.github.com/lenaschimmel/sc2rf/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247607764,"owners_count":20965945,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["covid","genetic","mutations","recombinants","sars-cov-2"],"created_at":"2024-11-06T02:35:33.420Z","updated_at":"2025-04-07T06:33:32.736Z","avatar_url":"https://github.com/lenaschimmel.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Sc2rf - SARS-Cov-2 Recombinant Finder\n_Pronounced: Scarf_\n\n## What's this?\nSc2rf can search genome sequences of SARS-CoV-2 for potential recombinants - new virus lineages that have (partial) genes from more than one parent lineage.\n\n## Is it already usable? \n**This is a very young project, started on March 5th, 2022. As such, proceed with care. Results may be wrong or misleading, and with every update, anything can still change a lot.**\n\nAnyway, I'm happy that scientists are already seeing benefits from Sc2rf and using it to prepare lineage proposals for [cov-lineages/pango-designation](https://github.com/cov-lineages/pango-designation/issues).\n\nThough I already have a lot of ideas and plans for Sc2rf (see at the bottom of this document), I'm very open for suggestions and feature requests. Please write an [issue](https://github.com/lenaschimmel/sarscov2recombinants/issues), start a [discussion](https://github.com/lenaschimmel/sarscov2recombinants/discussions) or get in touch via [mail](mailto:mail@lenaschimmel.de) or [twitter](https://twitter.com/LenaSchimmel)!\n\n## Example output\n![Screenshot of the terminal output of Sc2rf](screenshot-no-deletions.png)\n\n## Requirements and Installation\nYou need at least Python 3.6 and you need to install the requirements first. You might use something like `python3 -m pip install -r requirements.txt` to do that. There's a `setup.py` which you should probably ignore, since it's work in progress and does not work as intented yet.\n\nAlso, you need a terminal which supports ANSI control sequences to display colored text. On Linux, MacOS, etc. it should probably work. \n\nOn Windows, color support is tricky. On a recent version of Windows 10, it should work, but if it doesn't, install Windows Terminal from [GitHub](https://github.com/Microsoft/Terminal) or [Microsoft Store](https://www.microsoft.com/de-de/p/windows-terminal/9n0dx20hk701?rtc=1\u0026activetab=pivot:overviewtab) and run it from there.\n\n## Basic Usage\nStart with a `.fasta` file with one or more sequences which might contain recombinants. Your sequences have to be aligned to the `reference.fasta`. If they are not, you will get an error message like:\n\n\u003e Sequence hCoV-19/Phantasialand/EFWEFWD not properly aligned, length is 29718 instead of 29903.\n\n_(For historical reasons, I always used [Nextclade](https://docs.nextstrain.org/projects/nextclade/en/stable/user/nextclade-cli.html) to get aligned sequences, but you might also use [Nextalign](https://docs.nextstrain.org/projects/nextclade/en/stable/user/nextalign-cli.html) or any other tool. Installing them is easy on Linux or MacOS, but not on Windows. You can also use a web-based tool like [MAFFT](https://mafft.cbrc.jp/alignment/software/closelyrelatedviralgenomes.html).)_\n\nThen call:\n\n```\nsc2rf.py \u003cyour_filename.fasta\u003e\n```\n\nIf you just need some fasta files for testing, you can search the [pango-lineage proposals](https://github.com/cov-lineages/pango-designation/issues) for recombinant issues with fasta-files, or take some files from [my shared-sequences repository](https://github.com/lenaschimmel/shared-sequences), which might not contain any actual recombinants, but hundreds of sequences that look like they were!\n\n## No output / some sequences not shown\nBy default, a lot filters are active to show only the likely recombinants, so that you can input 10000s of sequences and just get output for the interesting ones. If you want, you can disable all filters like that, which is only recommended for small input files with less than 100 sequences:\n\n```\nsc2rf.py --parents 1-35 --breakpoints 0-100 \\\n--unique 1 --max-ambiguous 10000 \u003cyour_filename.fasta\u003e\n```\n\nor even\n\n```\nsc2rf.py --parents 1-35 --breakpoints 0-100 \\\n--unique 1 --max-ambiguous 10000 --force-all-parents \\\n--clades all \u003cyour_filename.fasta\u003e\n```\n\nThe meaning of these parameters is described below.\n\n## Advanced Usage\nYou can execute `sc2rf.py -h` to get excactly this help message:\n\n\u003c!-- BEGIN_MARKER --\u003e\n```\nusage: sc2rf.py [-h] [--primers [PRIMER ...]]\n                [--primer-intervals [INTERVAL ...]]\n                [--parents INTERVAL] [--breakpoints INTERVAL]\n                [--clades [CLADES ...]] [--unique NUM]\n                [--max-intermission-length NUM]\n                [--max-intermission-count NUM]\n                [--max-name-length NUM] [--max-ambiguous NUM]\n                [--force-all-parents]\n                [--select-sequences INTERVAL]\n                [--enable-deletions] [--show-private-mutations]\n                [--rebuild-examples] [--mutation-threshold NUM]\n                [--add-spaces [NUM]] [--sort-by-id [NUM]]\n                [--verbose] [--ansi] [--hide-progress]\n                [--csvfile CSVFILE]\n                [input ...]\n\nAnalyse SARS-CoV-2 sequences for potential, unknown recombinant\nvariants.\n\npositional arguments:\n  input                 input sequence(s) to test, as aligned\n                        .fasta file(s) (default: None)\n\noptional arguments:\n  -h, --help            show this help message and exit\n\n  --primers [PRIMER ...]\n                        Filenames of primer set(s) to visualize.\n                        The .bed formats for ARTIC and EasySeq\n                        are recognized and supported. (default:\n                        None)\n\n  --primer-intervals [INTERVAL ...]\n                        Coordinate intervals in which to\n                        visualize primers. (default: None)\n\n  --parents INTERVAL, -p INTERVAL\n                        Allowed number of potential parents of a\n                        recombinant. (default: 2-4)\n\n  --breakpoints INTERVAL, -b INTERVAL\n                        Allowed number of breakpoints in a\n                        recombinant. (default: 1-4)\n\n  --clades [CLADES ...], -c [CLADES ...]\n                        List of variants which are considered as\n                        potential parents. Use Nextstrain clades\n                        (like \"21B\"), or Pango Lineages (like\n                        \"B.1.617.1\") or both. Also accepts \"all\".\n                        (default: ['20I', '20H', '20J', '21I',\n                        '21J', 'BA.1', 'BA.2', 'BA.3'])\n\n  --unique NUM, -u NUM  Minimum of substitutions in a sample\n                        which are unique to a potential parent\n                        clade, so that the clade will be\n                        considered. (default: 2)\n\n  --max-intermission-length NUM, -l NUM\n                        The maximum length of an intermission in\n                        consecutive substitutions. Intermissions\n                        are stretches to be ignored when counting\n                        breakpoints. (default: 2)\n\n  --max-intermission-count NUM, -i NUM\n                        The maximum number of intermissions which\n                        will be ignored. Surplus intermissions\n                        count towards the number of breakpoints.\n                        (default: 8)\n\n  --max-name-length NUM, -n NUM\n                        Only show up to NUM characters of sample\n                        names. (default: 30)\n\n  --max-ambiguous NUM, -a NUM\n                        Maximum number of ambiguous nucs in a\n                        sample before it gets ignored. (default:\n                        50)\n\n  --force-all-parents, -f\n                        Force to consider all clades as potential\n                        parents for all sequences. Only useful\n                        for debugging.\n\n  --select-sequences INTERVAL, -s INTERVAL\n                        Use only a specific range of input\n                        sequences. DOES NOT YET WORK WITH\n                        MULTIPLE INPUT FILES. (default: 0-999999)\n\n  --enable-deletions, -d\n                        Include deletions in lineage comparision.\n\n  --show-private-mutations\n                        Display mutations which are not in any of\n                        the potential parental clades.\n\n  --rebuild-examples, -r\n                        Rebuild the mutations in examples by\n                        querying cov-spectrum.org.\n\n  --mutation-threshold NUM, -t NUM\n                        Consider mutations with a prevalence of\n                        at least NUM as mandatory for a clade\n                        (range 0.05 - 1.0, default: 0.75).\n\n  --add-spaces [NUM]    Add spaces between every N colums, which\n                        makes it easier to keep your eye at a\n                        fixed place. (default without flag: 0,\n                        default with flag: 5)\n\n  --sort-by-id [NUM]    Sort the input sequences by the ID. If\n                        you provide NUM, only the first NUM\n                        characters are considered. Useful if this\n                        correlates with meaning full meta\n                        information, e.g. the sequencing lab.\n                        (default without flag: 0, default with\n                        flag: 999)\n\n  --verbose, -v         Print some more information, mostly\n                        useful for debugging.\n\n  --ansi                Use only ASCII characters to be\n                        compatible with ansilove.\n\n  --hide-progress       Don't show progress bars during long\n                        task.\n\n  --csvfile CSVFILE     Path to write results in CSV format.\n                        (default: None)\n\nAn Interval can be a single number (\"3\"), a closed interval\n(\"2-5\" ) or an open one (\"4-\" or \"-7\"). The limits are inclusive.\nOnly positive numbers are supported.\n\n```\n\u003c!-- END_MARKER --\u003e\n\n\n\n## Interpreting the output\n_To be written..._\n\nThere already is a short [Twitter thread](https://twitter.com/LenaSchimmel/status/1506768971931996162) which explains the basics.\n\n## Source material attribution\n * `virus_properties.json` contains data from [LAPIS / cov-spectrum](https://lapis.cov-spectrum.org/) which uses data from [NCBI GenBank](https://www.ncbi.nlm.nih.gov/genbank/), prepared and hosted by Nextstrain, see [blog post](https://nextstrain.org/blog/2021-07-08-ncov-open-announcement).\n * `reference.fasta` is taken from Nextstrain's [nextclade_data](https://github.com/nextstrain/nextclade_data/tree/master/data/datasets/sars-cov-2/references/MN908947/versions/2022-03-04T12:00:00Z/files), see [NCBI](https://www.ncbi.nlm.nih.gov/nuccore/MN908947) for attribution. \n * `mapping.csv` is a modified version of the table on the [covariants homepage](https://covariants.org/) by Nextstrain.\n * Example output / screenshot based on Sequences published by the [German Robert-Koch-Institut](https://github.com/robert-koch-institut/SARS-CoV-2-Sequenzdaten_aus_Deutschland).\n * Primers:\n   * [ARTIC primers](https://github.com/artic-network/artic-ncov2019) CC-BY-4.0 by the ARTICnetwork project\n   * ~~[EasySeq primers](https://github.com/JordyCoolen/easyseq_covid19) by Coolen, J. P., Wolters, F., Tostmann, A., van Groningen, L. F., Bleeker-Rovers, C. P., Tan, E. C., ... \u0026 Melchers, W. J.~~ Removed until I understand the format if the `.bed` file. There will be an issue soon.\n   * [midnight primers](https://zenodo.org/record/3897530#.Xuk7oGpLjep) CC-BY-4.0 by Silander, Olin K, Massey University\n\nThe initial version of this program was written in cooperation with [@flauschzelle](https://github.com/flauschzelle).\n\n## TODO / IDEAS / PLANS\n * [ ] Move these TODOs into actual issues\n * [x] add disclaimer and link to pango-designation\n * [ ] provide a sample file (maybe both `.fasta` and `.csv`, as long as the csv step is still needed)\n * [X] accept aligned fasta \n   * [x] as input file\n   * [ ] as piped stream\n * [ ] If we still accept csv/ssv input, autodetect the delimiter either by file name or by analysing the first line\n * [ ] find a way to handle already designated recombinant lineages\n * [ ] Output structured results\n   * [ ] csv\n   * [ ] html?\n   * [ ] fasta of all sequences that match the criteria, which enables efficient multi-pass strategies\n * [ ] filter sequences\n   * [ ] by ID\n   * [ ] by metadata\n * [ ] take metadata csv\n * [ ] document the output in README\n * [ ] check / fix `--enabled-deletions`\n * [x] adjustable threshold for mutation prevalence\n * [ ] new color mode (with background color and monochrome text on top)\n * [ ] new bar mode (with colored lines beneath each sequence, one for each example sequence, and \"intermissions\" shown in the color of the \"surrounding\" lineage, but not as bright)\n * [ ] interactive mode, for filtering, reordering, etc.\n * [x] sort sequences within each block\n * [ ] re-think this whole \"intermission\" concept\n * [ ] select a single sequence and let the tool refine the choice of parental sequences, not just focusing on commonly known lineages (going up and down in the tree)\n * [ ] use more common terms to describe things (needs feedback from people with actual experience in the field)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flenaschimmel%2Fsc2rf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flenaschimmel%2Fsc2rf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flenaschimmel%2Fsc2rf/lists"}