{"id":19592848,"url":"https://github.com/soedinglab/spacepharer","last_synced_at":"2025-04-27T14:33:57.247Z","repository":{"id":94860386,"uuid":"202176754","full_name":"soedinglab/spacepharer","owner":"soedinglab","description":"SpacePHARER CRISPR Spacer Phage-Host pAiRs findER","archived":false,"fork":false,"pushed_at":"2024-05-08T13:45:17.000Z","size":12201,"stargazers_count":40,"open_issues_count":5,"forks_count":4,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-04-05T00:51:16.539Z","etag":null,"topics":["bioinformatics","crispr","host-pathogen","sequence-analysis"],"latest_commit_sha":null,"homepage":"https://spacepharer.soedinglab.org","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/soedinglab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-08-13T15:47:55.000Z","updated_at":"2025-03-31T10:04:03.000Z","dependencies_parsed_at":"2023-04-06T16:16:45.299Z","dependency_job_id":"9d284b1e-8c0d-4929-8beb-ce6029458f50","html_url":"https://github.com/soedinglab/spacepharer","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soedinglab%2Fspacepharer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soedinglab%2Fspacepharer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soedinglab%2Fspacepharer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soedinglab%2Fspacepharer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/soedinglab","download_url":"https://codeload.github.com/soedinglab/spacepharer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251154823,"owners_count":21544563,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","crispr","host-pathogen","sequence-analysis"],"created_at":"2024-11-11T08:37:11.367Z","updated_at":"2025-04-27T14:33:52.215Z","avatar_url":"https://github.com/soedinglab.png","language":"C","funding_links":[],"categories":["CRISPR Analysis"],"sub_categories":["Viral Orthologous Groups"],"readme":"# SpacePHARER: CRISPR Spacer Phage-Host pAiRs findER\n\nSpacePHARER is a modular toolkit for sensitive phage-host interaction identification using CRISPR spacers. SpacePHARER adapts the fast homology search capabilities of [MMseqs2](https://github.com/soedinglab/MMseqs2) to sensitively query short spacer sequences. It introduces a novel approach of aggregating sets of spacer-based hits to discover phage-host matches. SpacePHARER is GPLv3-licensed open source software implemented in C++ and available for Linux and macOS. The software is designed to run efficiently on multiple cores.\n\nSpacePHARER is also available as Google Colab notebook.\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/soedinglab/spacepharer/blob/master/examples/SpacePHARER.ipynb)\n\n[Zhang, R., Mirdita, M., Levy Karin, E., Norroy, C., Galiez, C., \u0026 Söding, J.  SpacePHARER: Sensitive identification of phages from CRISPR spacers in prokaryotic hosts. Bioinformatics, doi: 10.1093/bioinformatics/btab222 (2021).](https://doi.org/10.1093/bioinformatics/btab222)\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"https://github.com/soedinglab/spacepharer/blob/master/.github/SpacePHARER.png\" height=\"250\"/\u003e\u003c/p\u003e\n\n## Installation\n\nSpacePHARER can be used by compiling from source (see below) or downloading a statically compiled version. It requires a 64-bit system. We recommend using a system with at least the SSE4.1 instruction set (check by executing `cat /proc/cpuinfo | grep sse4_1` on Linux).\n\n    # install from bioconda\n    conda install -c conda-forge -c bioconda spacepharer\n    # pull docker container\n    docker pull soedinglab/spacepharer\n    # static Linux AVX2 build\n    wget https://mmseqs.com/spacepharer/spacepharer-linux-avx2.tar.gz; tar xvzf spacepharer-linux-avx2.tar.gz; export PATH=$(pwd)/spacepharer/bin/:$PATH\n    # static Linux SSE4.1 build\n    wget https://mmseqs.com/spacepharer/spacepharer-linux-sse41.tar.gz; tar xvzf spacepharer-linux-sse41.tar.gz; export PATH=$(pwd)/spacepharer/bin/:$PATH\n    # static macOS build (universal binary with SSE4.1/AVX2/M1 NEON)\n    wget https://mmseqs.com/spacepharer/spacepharer-osx-universal.tar.gz; tar xvzf spacepharer-osx-universal.tar.gz; export PATH=$(pwd)/spacepharer/bin/:$PATH\n\nPrecompiled binaries for other architectures (ARM64, PPC64LE) and very old AMD/Intel CPUs (SSE2 only) are available at [https://mmseqs.com/spacepharer](https://mmseqs.com/spacepharer).\n\n### Compile from source\n\nCompiling SpacePHARER from source has the advantage of system-specific optimizations, which should improve its performance. To compile SpacePHARER `git`, `g++` (4.9 or higher) and `cmake` (3.0 or higher) are required. Afterwards, the SpacePHARER binary will be located in the `build/bin` directory.\n\n    git clone https://github.com/soedinglab/spacepharer.git\n    cd spacepharer\n    mkdir build\n    cd build\n    cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..\n    make -j\n    make install\n    export PATH=$(pwd)/spacepharer/bin/:$PATH\n\n:exclamation: If you want to compile SpacePHARER on macOS, please install and use `gcc` from Homebrew. The default macOS `clang` compiler does not support OpenMP and SpacePHARER will not be able to run multithreaded. Adjust the `cmake` call above to:\n\n    CC=\"$(brew --prefix)/bin/gcc-10\" CXX=\"$(brew --prefix)/bin/g++-10\" cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..\n\n## Input\n\nSpacePHARER will conduct a similarity search between six-frame translated CRISPR spacer sequences and sets of phage **ORFs** (open reading frames), combine multiple evidences (**hits**) found between the two sets and predict prokaryote-phage pairs (**matches**) with strictly controlled **FDR** (false discovery rate). The starting point are FASTA files of nucleotide sequences (`.fasta` or `.fasta.gz`). Spacers should be provided in multiple FASTA files, each containing spacers from one genome. For spacers, SpacePHARER also accepts output files from the following common CRISPR array analysis tools: [PILER-CR](https://www.drive5.com/pilercr/), [CRT](http://www.room220.com/crt/), [MinCED](https://github.com/ctSkennerton/minced) (derived from CRT format) and [CRISPRDetect](http://crispr.otago.ac.nz/CRISPRDetect/predict_crispr_array.html). Phage genomes are supplied as separate FASTA files (one genome per file), or can be downloaded with `downloaddb`(see below).\n\n## Running SpacePHARER\n\n### Main Modules\n\n* `easy-predict`      Predict phage-host matches from multiFASTA and common spacer files (PILER-CR, CRISPRDetect and CRT)\n* `downloaddb`        Download spacers or GenBank phage genomes and create sequence database\n* `createsetdb`       Create sequence database from FASTA input\n* `predictmatch`      Predict host-phage matches\n* `parsespacer`       Parse a file containing CRISPR array in supported formats (CRT,PILER-CR and CRISPRDetect)\n\n### Important parameters\n\n    --reverse-fragments      reverse AA fragments (ORFs) to generate control setDB\n    --fdr                    false discovery rate cut-off to determine S_comb threshold of predictions\n    --fmt                    output format for predictmatch. (0: short (only matches); 1: long (matches and hits); 2: long with nucleotide alignment)\n\n\n### Quick start\n\nTo start, you need to create a database of the phage genomes `targetSetDB` and control sequences `targetSetDB_rev`, against which no true match is expected. Here we create control sequences by reversing the extracted ORFs.\n\n    spacepharer createsetdb examples/GCA*.fna.gz targetSetDB tmpFolder\n    spacepharer createsetdb examples/GCA*.fna.gz targetSetDB_rev tmpFolder --reverse-fragments 1\n\nAlternatively, you can use `downloaddb` to download a list of phage genomes. Here, the reversed control sequences are automatically created.\n\n    # GenBank_phage_2018_09 is a set of nearly 8000 phage genomes\n    spacepharer downloaddb GenBank_phage_2018_09 targetSetDB tmpFolder\n      \n    # Alternatively you can pass a list of URLs to downloadgenome\n    spacepharer downloaddb examples/genome_list.tsv targetSetDB tmpFolder\n\nThe `easy-predict` workflow directly returns a tab-separated (`.tsv`) file containing phage-host predictions from (multiple) FASTA files\n\n    spacepharer easy-predict examples/*.fas targetSetDB predictions.tsv tmpFolder\n\n or supported CRISPR array files (CRT/MinCED, PILER-CR or CRISPRDetect) queries.\n \n    spacepharer easy-predict examples/crisprdetect_test examples/pilercr_test targetSetDB predictions.tsv tmpFolder\n \n### Creating databases\n\nBefore search, query or target sequences contained in FASTA files need to be converted to database format by calling `createsetdb`. This command first creates a sequence DB, then extracts and translates all putative protein fragments (ORFs), and finally generates associated metadata. For spacer sequences, setting the parameter `--extractorf-spacer 1` is important to properly extract putative protein fragments from the spacers, which are usually a short partial ORF, not necessarily in frame.\n\n    spacepharer createsetdb Query1.fasta [...QueryN.fasta] querySetDB tmpFolder --extractorf-spacer 1\n    spacepharer createsetdb Target1.fasta [...TargetN.fasta] targetSetDB tmpFolder\n\nYou will also need to generate a control target set DB to allow SpacePHARER to calibrate the cutoff for reporting matches. SpacePHARER enables generating such control by reversing the protein fragments of your provided target DB using the parameter ```--reverse-fragments 1```:\n\n    spacepharer createsetdb Target1.fasta [...TargetN.fasta] controlSetDB tmpFolder --reverse-fragments 1\n      \n      \n#### Downloading query CRISPR spacer sets\n\nAs an alternative to creating query setDB, you can use `downloaddb` to download a comprehensive set of CRISPR spacers. The query setDB will be automatically created in the provided path.\n\n    # spacers_shmakov_et_al_2017 is a set of more than 30000 CRISPR spacer sets (Shmarkov et al., 2017)\n    spacepharer downloaddb spacers_shmakov_et_al_2017 querySetDB tmpFolder\n    \n    # spacers_dion_et_al_2021 is a set of more than 490000 CRISPR spacer sets (Dion et al., 2021)\n    spacepharer downloaddb spacers_dion_et_al_2021 querySetDB tmpFolder\n\n#### Downloading target genomes\n\nAs an alternative to creating target and control setDB, the `downloaddb` module will download the provided list of URLs to phage genomes or a predefined list of phage genomes and create a target setDB in the provided path.\n\n    spacepharer downloaddb GenBank_phage_2018_09 targetSetDB tmpFolder\n      \n    # Generating control sequences can be disabled if a different set will be used\n    spacepharer downloaddb GenBank_phage_2018_09 targetSetDB tmpFolder --reverse-setdb 0\n\nA list of predefined spacer or phage catalogues can be shown by executing `downloaddb` without additional parameters:\n\n    spacepharer downloaddb\n\nA file containing URLs can also be supplied to `downloaddb`:\n\n    spacepharer downloaddb urls.txt targetSetDB tmpFolder\n    # urls.txt content\n    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/836/905/GCA_000836905.1_ViralProj14035/GCA_000836905.1_ViralProj14035_genomic.fna.gz\n    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/845/445/GCA_000845445.1_ViralProj14409/GCA_000845445.1_ViralProj14409_genomic.fna.gz\n\n#### Adding taxonomic labels\n\nIf input spacers are supplied together with taxonomic identifiers, lowest common ancestors (LCA) of the phages are computed for each spacer. If input genomes are supplied together with taxonomic identifiers, the LCA of the host are computed for each phage.\nAlso, the [SpacePHARER output](#the-spacepharer-output) will contain taxonomic information for each match.\n\nDatabases download from the predefined entries in `downloaddb` come with taxonomic information already included. For custom databases, extra steps have to be taken:\n\n#### Taxonomic labels in `createsetdb`\n\nWhen calling `createsetdb` you can supply a tab-separated list of file names to NCBI taxonomy identifiers with the `--tax-mapping-file` parameter:\n\n    # create phage database with taxonomic labels\n    spacepharer createsetdb Target1.fasta [...TargetN.fasta] targetSetDB tmpFolder --tax-mapping-file targets.tsv\n    # targets.tsv content (\\t is a tab character)\n    Target1.fasta\\t10665\n    ...\n    TargetN.fasta\\t31754\n\n#### Taxonomic labels in `downloaddb URLFILE`\n\nIf an file containing URLs is supplied to `downloaddb` taxonomic labels can be provided in the second column of the URL file:\n\nA file containing URLs can also be supplied to `downloaddb`:\n\n    spacepharer downloaddb urls.txt targetSetDB tmpFolder\n    # urls.txt content (\\t is a tab character)\n    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/836/905/GCA_000836905.1_ViralProj14035/GCA_000836905.1_ViralProj14035_genomic.fna.gz\\t10679\n    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/845/445/GCA_000845445.1_ViralProj14409/GCA_000845445.1_ViralProj14409_genomic.fna.gz\\t244310\n\n### Parsing spacer files\n\nIf you wish to provide spacer files (CRT/MinCED, PILER-CR or CRISPRDetect) as query, `parsespacer` will extract spacer sequences from each file and create a sequence DB. Then use `createsetdb` to generate associated metadata.\n\n    spacepharer parsespacer spacerFile1.txt [...spacerFileN.txt] queryDB \n    spacepharer createsetdb queryDB querySetDB tmpFolder --extractorf-spacer 1\n\n#### Sample commands for running spacer extraction tools\n\nBelow are sample commands for common spacer extraction tools whose format is accepted by SpacePHARER. If you use any of these tools, it is advisable to follow any updates to their commands on their user manuals.\n\nPILER-CR: Use of `-noinfo` is mandatory. Otherwise, the format cannot be read.\n\n    pilercr -noinfo -quiet -in prok.fasta -out prok.txt\n\nCRT:\n\n    java -cp CRT1.2-CLI.jar crt prok.fasta prok.txt\n\nMinCED:\n\n    minced prok.fasta prok.txt\n\nCRISPRDetect is only available as a [web server](http://crispr.otago.ac.nz/CRISPRDetect/predict_crispr_array.html).\n\n### Searching and predicting matches\n\nThe `predictmatch` workflow gives more control of the execution of the prediction. Here a seperate control sequence set DB `controlSetDB` can be used. For example, we can assume that any spacer hit towards a eukoryota-targeting virus is a false positive:\n\n    spacepharer downloaddb GenBank_phage_2018_09 targetSetDB tmp --reverse-setdb 0\n    spacepharer downloaddb GenBank_eukvir_2018_09 controlSetDB tmp --reverse-setdb 0\n    spacepharer predictmatch querySetDB targetSetDB controlSetDB outputFileName.tsv tmpFolder\n\n### The SpacePHARER output\n\nUpon completion, SpacePHARER outputs a tab-separated text file (`.tsv`). Each prokaryotic-phage match spans two or more lines:\n\n    #prok_acc  phage_acc   S_comb      num_hits\n    \u003espacer_acc      phage_acc   p_bh    spacer_start      spacer_end  phage_start phage_end   5'_PAM|3'_PAM    5'_PAM|3'_PAM(reverse strand)\n    *NUCL_SEQ_ALN_SPACER*\n    *NUCL_SEQ_ALN_PHAGE*\n\nThe first line starts with `#`: prokaryotic accession, phage accession, combined score and number of hits in the match.\n\nEach following line describes an individual hit: spacer accession, phage accession, p besthit, spacer start and end, phage start and end, possible 5’ PAM|3’ PAM, possible 5’ PAM|3’ PAM on the reverse strand.\n\nOptionally, the aligned spacer and phage sequences can be printed in two additional lines following each hit line, using `--fmt 2`\n\n`--fmt 0` will output a short-format, if you wish to only see the match line.\n\nIf the phage database was created with taxonomic labels, a result file with the suffix `_lca.tsv` is also created. The first column of this file contains the spacer accession and the remaining columns are described in the [MMseqs2 wiki](https://github.com/soedinglab/MMseqs2/wiki#taxonomy-output-and-tsv).\n\nIf the spacer database was created with taxonomic labels, a result file with the suffix `_lca_per_target.tsv` is also created. The first column of this file contains the phage genome file name and the remaining columns are described in the [same MMseqs2 wiki entry](https://github.com/soedinglab/MMseqs2/wiki#taxonomy-output-and-tsv). Additionally, each match in the base `.tsv` will also contain these columns.\n\n### Removing temporary files\n\nDuring the workflow execution, SpacePHARER will keep all intermediate outputs in `tmpFolder`, passing the `--remove-tmp-files` parameter will clear out the `tmpFolder` after workflows have finished.\n\n## Hardware requirements\n\nSpacePHARER will scale its memory consumption based on the available main memory of the machine. SpacePHARER needs a 64-bit CPU with at least the SSE2 instruction set to run.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoedinglab%2Fspacepharer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsoedinglab%2Fspacepharer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoedinglab%2Fspacepharer/lists"}