{"id":32111533,"url":"https://github.com/crisprverse/crisprdesign","last_synced_at":"2025-10-20T14:37:03.761Z","repository":{"id":56777276,"uuid":"523800050","full_name":"crisprVerse/crisprDesign","owner":"crisprVerse","description":"Comprehensive design of CRISPR gRNAs for nucleases and base editors","archived":false,"fork":false,"pushed_at":"2025-03-05T18:46:42.000Z","size":6213,"stargazers_count":24,"open_issues_count":10,"forks_count":5,"subscribers_count":2,"default_branch":"devel","last_synced_at":"2025-10-13T04:37:50.060Z","etag":null,"topics":["bioconductor","bioconductor-package","crispr","crispr-cas9","crispr-design","crispr-target","genomics-analysis","grna","grna-sequence","grna-sequences","sgrna","sgrna-design"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/crisprVerse.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2022-08-11T16:48:52.000Z","updated_at":"2025-10-05T13:25:38.000Z","dependencies_parsed_at":"2024-01-13T03:46:30.110Z","dependency_job_id":"a14394e9-1421-4ecb-be12-5a3b124c6e9a","html_url":"https://github.com/crisprVerse/crisprDesign","commit_stats":{"total_commits":350,"total_committers":6,"mean_commits":"58.333333333333336","dds":0.4314285714285714,"last_synced_commit":"98724dbafe87863723bf829f9300cafa5852c130"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/crisprVerse/crisprDesign","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crisprVerse%2FcrisprDesign","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crisprVerse%2FcrisprDesign/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crisprVerse%2FcrisprDesign/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crisprVerse%2FcrisprDesign/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/crisprVerse","download_url":"https://codeload.github.com/crisprVerse/crisprDesign/tar.gz/refs/heads/devel","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crisprVerse%2FcrisprDesign/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":280106114,"owners_count":26273104,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-20T02:00:06.978Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioconductor","bioconductor-package","crispr","crispr-cas9","crispr-design","crispr-target","genomics-analysis","grna","grna-sequence","grna-sequences","sgrna","sgrna-design"],"created_at":"2025-10-20T14:36:59.204Z","updated_at":"2025-10-20T14:37:03.742Z","avatar_url":"https://github.com/crisprVerse.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\ntitle: \"Introduction to crisprDesign\"\noutput: \n  github_document:\n    toc: true\nbibliography: vignettes/references.bib\n---\n\n```{r, echo=FALSE, results=\"hide\"}\noptions(\"knitr.graphics.auto_pdf\"=TRUE)\n```\n\nAuthors: Jean-Philippe Fortin, Aaron Lun, Luke Hoberecht\n\nDate: July 1, 2022\n\n\n# Introduction\n\n`crisprDesign` is the core package of the\n[crisprVerse](https://github.com/crisprVerse) ecosystem,\nand plays the role of a \none-stop shop for designing and annotating\nCRISPR guide RNA (gRNA) sequences. This includes the characterization of \non-targets and off-targets using different aligners, on- and off-target\nscoring, gene context annotation, SNP annotation, sequence feature\ncharacterization, repeat annotation, and many more.  \nThe software was developed to be as applicable and generalizable as\npossible. \n\nIt currently support five types of \nCRISPR modalities (modes of perturbations): CRISPR knockout (CRISPRko), CRISPR\nactivation (CRISPRa), CRISPR interference (CRISPRi), CRISPR base editing\n(CRISPRbe), and CRISPR knockdown (CRISPRkd) (see @crispracrisprireview for a review of CRISPR modalities). \n\nIt utilizes the `crisprBase` package to enable gRNA design for any\nCRISPR nuclease and base editor via the `CrisprNuclease` and `BaseEditor`\nclasses, respectively. Nucleases that are commonly used in the field are \nprovided, including DNA-targeting nucleases (e.g. SpCas9, AsCas12a) and \nRNA-targeting nucleases (e.g. CasRx (RfxCas13d)).\n\n`crisprDesign` is fully developed to work with the genome of any organism, and\ncan also be used to design gRNAs targeting custom DNA sequences.\n\nFinally, more specialized gRNA design functionalities are also available,\nincluding design for optical pooled screening (OPS), paired gRNA design, \nand gRNA filtering and ranking functionalities.\n \nThis vignette is meant to be an overview of the main features included in\nthe package, using toy examples for the sake of time (the vignette has to\ncompile within a few minutes, as required by Bioconductor). For detailed\nand comprehensive tutorials, please visit our [crisprVerse tutorials page](https://github.com/crisprVerse/Tutorials). \n\n# Installation\n\n`crisprDesign` can be installed from from the Bioconductor devel branch\nusing the following commands in a fresh R session:\n\n```{r, eval=FALSE}\nif (!require(\"BiocManager\", quietly = TRUE))\n    install.packages(\"BiocManager\")\n\nBiocManager::install(version=\"devel\")\nBiocManager::install(\"crisprDesign\")\n```\n\nUsers interested in contributing to `crisprDesign` might want to look at the \nfollowing CRISPR-related package dependencies:\n\n- [crisprBase](https://github.com/crisprVerse/crisprBase): core CRISPR functions and S4 objects\n- [crisprBowtie](https://github.com/crisprVerse/crisprBowtie): aligns gRNA spacers to genomes using the ungapped \naligner `bowtie`\n- [crisprBwa](https://github.com/crisprVerse/crisprBWa): aligns gRNA spacers to genomes using the ungapped \naligner `BWA`\n- [crisprScore](https://github.com/crisprVerse/crisprScore): implements state-of-the-art on- and off-target scoring \nalgorithms\n- [crisprViz](https://github.com/crisprVerse/crisprViz): gRNA visualization using genomic tracks\n\nYou can contribute to the package by submitting pull requests to our [GitHub repo](https://github.com/crisprVerse/crisprDesign). \n\nThe complete documentation for the package can be found [here](https://bioconductor.org/packages/devel/bioc/manuals/crisprDesign/man/crisprDesign.pdf).\n\n\n# Terminology\n\nCRISPR nucleases are examples of RNA-guided endonucleases. They require two\nbinding components for cleavage. First, the nuclease needs to recognize a\nconstant nucleotide motif in the target DNA called the protospacer adjacent\nmotif (PAM) sequence. Second, the gRNA, which guides the nuclease to the target\nsequence, needs to bind to a complementary sequence adjacent to the PAM\nsequence, called the **protospacer** sequence. The latter can be thought of as a\nvariable binding motif that can be specified by designing corresponding gRNA\nsequences.\n\nThe **spacer** sequence is used in the gRNA construct to guide\nthe CRISPR nuclease to the target **protospacer** sequence in the host genome.\n\nFor DNA-targeting nucleases, the nucleotide sequence of the spacer and protospacer are identical. For RNA-targeting nucleases, they are the reverse complement of each other. \n\nWhile a gRNA spacer sequence may not always uniquely target the host genome\n(i.e. it  may map to multiple protospacers in the host genome),\nwe can, for a given reference genome, uniquely identify a protospacer \nsequence with a combination of 3 attributes: \n\n- `chr`: chromosome name \n- `strand`: forward (+) or reverse (-)\n- `pam_site`: genomic coordinate of the first nucleotide of the \nnuclease-specific PAM sequence (e.g. for SpCas9, the \"N\" in the NGG PAM \nsequence; for AsCas12a, the first \"T\" of the TTTV PAM sequence)\n\nFor CRISPRko, we use an additional genomic coordinate, called `cut_site`, \nto represent where the double-stranded break (DSB) occurs. For SpCas9, the cut\nsite (blunt-ended dsDNA break) is located 4nt upstream of the pam_site\n(PAM-proximal editing). For AsCas12a, the 5nt 5' overhang dsDNA break will\ncause a cut 19nt after the PAM sequence on the targeted strand, and 23nt after\nthe PAM sequence on the opposite strand (PAM-distal editing).\n\n\n\n# CRISPRko design\n\nWe will illustrate the main functionalities of `crisprDesign` by \nperforming a common task: designing gRNAs to knock out a coding gene. In our\nexample, we will design gRNAs for the wildtype SpCas9 nuclease, with spacers\nhaving a length of 20nt. \n\n\n```{r, message=FALSE, warning=FALSE,results='hide' }\nlibrary(crisprDesign)\n```\n\n##  Nuclease specification\n\nThe `crisprBase` package provides functionalities to create objects that store\ninformation about CRISPR nucleases, and functions to interact with those\nobjects (see the `crisprBase` vignette). It also provides commonly-used CRISPR\nnucleases. Let's look at the `SpCas9` nuclease object:\n\n```{r}\nlibrary(crisprBase)\ndata(SpCas9, package=\"crisprBase\")\nSpCas9\n```\n\nThe three motifs (NGG, NAG and NGA) represent the recognized PAM sequences by\nSpCas9, and the weights indicate a recognition score. The canonical PAM\nsequence NGG is fully recognized (weight of 1), while the two non-canonical\nPAM sequences NAG and NGA are much less tolerated. \n\nThe spacer sequence is located on the 5-prime end with respect to the PAM\nsequence, and the default spacer sequence length is 20 nucleotides.\nIf necessary, we can change the spacer length using the function\n`crisprBase::spacerLength`. Let's see what the protospacer\nconstruct looks like by using `prototypeSequence`:\n\n```{r}\nprototypeSequence(SpCas9)\n```\n\n\n## Target DNA specification\n\nAs an example, we will design gRNAs that knockout the human gene IQSEC3 by\nfinding all protospacer sequences located in the coding region (CDS) \nof IQSEC3.\n\nTo do so, we need to create a `GRanges` object that defines the genomic\ncoordinates of the CDS of IQSEC3 in a reference genome.\n\n\nThe toy dataset `grListExample` object in `crisprDesign` contains gene \ncoordinates in hg38 for exons of all human IQSEC3 isoforms, and was\nobtained by converting an Ensembl `TxDb` object into a `GRangesList`\nobject using the `TxDb2GRangesList` convenience function in `crisprDesign`. \n\n```{r}\ndata(grListExample, package=\"crisprDesign\")\n```\n\nThe `queryTxObject` function allows us to query such objects for a specific\ngene and feature. Here, we obtain a `GRanges` object containing the CDS\ncoordinates of IQSEC3:\n\n\n```{r echo=TRUE, results='hide', warning=FALSE, message=FALSE}\ngr \u003c- queryTxObject(txObject=grListExample,\n                    featureType=\"cds\",\n                    queryColumn=\"gene_symbol\",\n                    queryValue=\"IQSEC3\")\n```\n\nWe will only consider the first exon to speed up design:\n\n```{r}\ngr \u003c- gr[1]\n```\n\n\n\n## Designing spacer sequences\n\n`findSpacers` is the main function to obtain a list of all\npossible spacer sequences targeting protospacers located in the target\nDNA sequence(s). If a `GRanges` object is provided as input, a `BSgenome`\nobject (object containing sequences of a reference genome) will need to be\nprovided as well:\n\n```{r, warning=FALSE, message=FALSE}\nlibrary(BSgenome.Hsapiens.UCSC.hg38)\nbsgenome \u003c- BSgenome.Hsapiens.UCSC.hg38\nguideSet \u003c- findSpacers(gr,\n                        bsgenome=bsgenome,\n                        crisprNuclease=SpCas9)\nguideSet\n```\n\nThis returns a `GuideSet` object that stores genomic coordinates for all spacer\nsequences found in the regions provided by `gr`. The `GuideSet` object is an\nextension of a `GenomicRanges` object that stores additional information about\ngRNAs. \n\nFor the subsequent sections, we will only work with a random subset of 20 \nspacer sequences:\n\n```{r}\nset.seed(10)\nguideSet \u003c- guideSet[sample(seq_along((guideSet)),20)]\n```\n\nSeveral accessor functions are provided to extract information about the\nspacer sequences:\n\n\n```{r}\nspacers(guideSet)\nprotospacers(guideSet)\npams(guideSet)\nhead(pamSites(guideSet))\nhead(cutSites(guideSet))\n```\n\nThe genomic locations stored in the IRanges represent the PAM site locations in the reference genome. \n\n\n## Sequence features characterization\n\nThere are specific spacer sequence features, independent of the genomic\ncontext of the protospacer sequence, that can reduce or even eliminate gRNA\nactivity:\n\n- **Poly-T stretches**: four or more consecutive T nucleotides in the \nspacer sequence may act as a transcriptional termination signal for \nthe U6 promoter.\n- **Self-complementarity**: complementary sites with the gRNA backbone \ncan compete with the targeted genomic sequence.\n- **Percent GC**: gRNAs with GC content between 20% and 80% are preferred.\n\nUse the function `addSequenceFeatures` to adds these spacer sequence\ncharacteristics to the `GuideSet` object:\n\n\n```{r, eval=TRUE, warning=FALSE, message=FALSE}\nguideSet \u003c- addSequenceFeatures(guideSet)\nhead(guideSet)\n```\n\n\n## Off-target search\n\n\nIn order to select gRNAs that are most specific to our target \nof interest, it is important to avoid gRNAs that target additional \nloci in the genome with either perfect sequence complementarity \n(multiple on-targets), or imperfect complementarity through \ntolerated mismatches (off-targets). \n\nFor instance, both the SpCas9 and AsCas12a nucleases can be tolerant\nto mismatches between the gRNA spacer sequence (RNA) and the protospacer\nsequence (DNA), thereby making it critical to characterize off-targets to\nminimize the introduction of double-stranded breaks (DSBs) beyond\nour intended target. \n\n\nThe `addSpacerAlignments` function appends a list of putative on-\nand off-targets to a `GuideSet` object using one of three methods. The first \nmethod uses the fast aligner\n[bowtie](http://bowtie-bio.sourceforge.net/index.shtml)\n[@langmead2009bowtie] via the `crisprBowtie` package to map spacer sequences\nto a specified reference genome. This can be done by specifying\n`aligner=\"bowtie\"` in `addSpacerAlignments`.\n\nThe second method uses the fast aligner\n[BWA](https://github.com/lh3/bwa) via the `crisprBwa` package to map \nspacer sequences to a specified reference genome. \nThis can be done by specifying\n`aligner=\"bwa\"` in `addSpacerAlignments`. Note that this is not available\nfor Windows machines.\n\nThe third method uses the package `Biostrings` to search for similar sequences\nin a set of DNA coordinates sequences, usually provided through a `BSGenome` \nobject. This can be done by specifying\n`aligner=\"biostrings\"` in `addSpacerAlignments`. This is extremely slow,\nbut can be useful when searching for off-targets in custom short DNA\nsequences. \n\n\nWe can control the alignment parameters and output using several \nfunction arguments. `n_mismatches` sets the maximum number of permitted \ngRNA:DNA mismatches (up to 3 mismatches). `n_max_alignments` specifies the \nmaximum number of alignments for a given gRNA spacer sequence \n(1000 by default). The `n_max_alignments` parameter may be overruled by \nsetting `all_Possible_alignments=TRUE`, which returns all possible \nalignments. `canonical=TRUE` filters out protospacer sequences\nthat do not have a canonical PAM sequence.\n\n\nFinally, the `txObject` argument in `addSpacerAlignmentsused`\nallows users to provide a `TxDb` object, or a `TxDb` object\nconverted in a `GRangesList` using the `TxDb2GRangesList` function, to \nannotate genomic alignments with a gene model annotation. This is useful\nto understand whether or not off-targets are located in the CDS of\nanother gene, for instance. \n\nFor the sake of time, we will search here for on- and off-targets located\nin the beginning of the human chr12 where the gene IQSEC3 is located.\nWe will the bowtie method, with a maximum of 1 mismatch.\n\nFirst, we need to build a bowtie index sequence using the fasta file provided\nin `crisprDesign`. We use the `RBowtie` package to build the index:\n\n```{r}\nlibrary(Rbowtie)\nfasta \u003c- system.file(package=\"crisprDesign\", \"fasta/chr12.fa\")\noutdir \u003c- tempdir()\nRbowtie::bowtie_build(fasta,\n                      outdir=outdir,\n                      force=TRUE,\n                      prefix=\"chr12\")\nbowtie_index \u003c- file.path(outdir, \"chr12\")\n```\n\nFor genome-wide off-target search, users will need to create a bowtie\nindex on the whole genome. This is explained \nin [this tutorial](https://github.com/crisprVerse/Tutorials/tree/master/Building_Genome_Indices).\n\nFinally, we also need to specify a `BSgenome` object storing DNA sequences\nof the human reference genome:\n\n\n```{r, results='hide', warning=FALSE}\nlibrary(BSgenome.Hsapiens.UCSC.hg38)\nbsgenome \u003c- BSgenome.Hsapiens.UCSC.hg38\n```\n\nWe are now ready to search for on- and off-targets:\n\n```{r, results='hide', warning=FALSE}\nguideSet \u003c- addSpacerAlignments(guideSet,\n                                txObject=grListExample,\n                                aligner_index=bowtie_index,\n                                bsgenome=bsgenome,\n                                n_mismatches=1)\n```\n\n\nLet's look at what was added to the `GuideSet`:\n\n```{r}\nguideSet\n```\n\nA few columns were added to the `GuideSet` object to summarize the number of\non- and off-targets for each spacer sequence, taking into account genomic\ncontext:\n\n- **n0, n1, n2, n3**: specify number of alignments with 0, 1, 2 and 3\nmismatches, respectively.\n- **n0_c, n1_c, n2_c, n3_c**: specify number of alignments in a coding region,\nwith 0, 1, 2 and 3 mismatches, respectively.\n- **n0_p, n1_p, n2_p, n3_p**: specify number of alignments in a promoter region\nof a coding gene, with 0, 1, 2 and 3 mismatches, respectively.\n\nTo look at the individual on- and off-targets and their context, use the\n`alignments` function to retrieve a table of all genomic alignments stored in\nthe `GuideSet` object:\n\n```{r}\nalignments(guideSet)\n```\n\nThe functions `onTargets` and `offTargets` will return on-target alignments\n(no mismatch) and off-target alignment (with at least one mismatch),\nrespectively. See `?addSpacerAlignments` for more details about the \ndifferent options.\n\n\n\n### Iterative spacer alignments\n\ngRNAs that align to hundreds of different locations are highly unspecific\nand undesirable. This can also cause `addSpacerAlignments` to be slow. \nTo mitigate this, we provide `addSpacerAlignmentsIterative`, an iterative\nversion of `addSpacerAlignments` that curtails alignment searches \nfor gRNAs having more hits than the user-defined \nthreshold (see `?addSpacerAlignmentsIterative`).\n\n### Faster alignment by removing repeat elements\n\nTo remove protospacer sequences located in repeats or low-complexity\nDNA sequences (regions identified by RepeatMasker), which are usually \nnot of interest due to their low specificity, we provide the convenience \nfunction `removeRepeats`:\n\n```{r, eval=TRUE}\ndata(grRepeatsExample, package=\"crisprDesign\")\nguideSet \u003c- removeRepeats(guideSet,\n                          gr.repeats=grRepeatsExample)\n```\n\n\n## Off-target scoring\n\nAfter retrieving a list of putative off-targets and on-targets for\na given spacer sequence, we can use `addOffTargetScores` to \npredict the likelihood of the nuclease to cut at the off-targets based\non mismatch tolerance. Currently, only off-target scoring for the SpCas9\nnuclease are available (MIT and CFD algorithms):\n\n```{r, eval=TRUE, warning=FALSE, message=FALSE}\nguideSet \u003c- addOffTargetScores(guideSet)\nguideSet\n```\n\nNote that this will only work after calling `addSpacerAlignments`,\nas it requires a list of off-targets for each gRNA entry. The returned\n`GuideSet` object has now the additional columns `score_mit` and `score_cfd`\nrepresenting the gRNA-level aggregated off-target specificity scores. The \noff-target table also contains a cutting likelihood score for each gRNA \nand off-target pair:\n\n```{r}\nhead(alignments(guideSet))\n```\n\n## On-target scoring\n\n`addOnTargetScores` adds scores from all on-target efficiency \nalgorithms available in the R package `crisprScore` and \nappends them to the `GuideSet`. By default, scores for all available methods\nfor a given nuclease will be computed. Here, for the sake of time,\nlet's add only the CRISPRater score:\n\n```{r, eval=TRUE, warning=FALSE, message=FALSE}\nguideSet \u003c- addOnTargetScores(guideSet, methods=\"crisprater\")\nhead(guideSet)\n```\n\nSee the `crisprScore` vignette for a full description of the different scores. \n\n\n\n## Restriction enzymes\n\nRestriction enzymes are usually involved in the gRNA library synthesis process.\nRemoving gRNAs that contain specific restriction sites is often necessary.\nWe provide the function `addRestrictionEnzymes` to indicate whether or not\ngRNAs contain restriction sites for a user-defined set of enzymes:\n\n```{r, eval=TRUE, warning=FALSE, message=FALSE, results='hide'}\nguideSet \u003c- addRestrictionEnzymes(guideSet)\n```\n\nWhen no enzymes are specified, the function adds annotation for the following\ndefault enzymes: EcoRI, KpnI, BsmBI, BsaI, BbsI, PacI, ISceI and MluI. The\nfunction also has two additional arguments, `flanking5` and `flanking3`, to\nspecify nucleotide sequences flanking the spacer sequence (5' and 3',\nrespectively) in the lentiviral cassette that will be used for gRNA delivery.\nThe function will effectively search for restriction sites in the full sequence\n`[flanking5][spacer][flanking3]`.\n\nThe `enzymeAnnotation` function can be used to retrieve the added annotation:\n\n```{r}\nhead(enzymeAnnotation(guideSet))\n```\n\n\n## Gene annotation\n\nThe function `addGeneAnnotation` adds transcript- and gene-level \ncontextual information to gRNAs from a `TxDb`-like object:\n\n```{r, eval=TRUE,warning=FALSE, message=FALSE, results='hide'} \nguideSet \u003c- addGeneAnnotation(guideSet,\n                              txObject=grListExample)\n``` \n\nThe gene annotation can be retrieved using the function `geneAnnotation`:\n\n```{r}\ngeneAnnotation(guideSet)\n```\n\nIt contains a lot of information that contextualizes\nthe genomic location of the protospacer sequences.\n\nThe ID columns (`tx_id`, `gene_id`, `protein_id`, `exon_id`) give Ensembl IDs.\nThe `exon_rank` gives the order of the exon for the transcript, for example \"2\"\nindicates it is the second exon (from the 5' end) in the mature transcript. \n\nThe columns `cut_cds`, `cut_fiveUTRs`, `cut_threeUTRs` and `cut_introns` \nindicate whether the guide sequence overlaps with CDS, 5' UTR, 3' UTR,\nor an intron, respectively. \n\n`percentCDS` gives the location of the `cut_site` within the transcript as a\npercent from the 5' end to the 3' end. `aminoAcidIndex` gives the number of the\nspecific amino acid in the protein where the cut is predicted to occur.\n`downstreamATG` shows how many in-frame ATGs are downstream of the `cut_site`\n(and upstream from the defined percent transcript cutoff, `met_cutoff`),\nindicating a potential alternative translation initiation site that may\npreserve protein function. \n\nFor more information about the other columns, type `?addGeneAnnotation`.\n\n\n## TSS annotation\n\nSimilarly, one might want to know which protospacer sequences are located\nwithin promoter regions of known genes: \n\n```{r}\ndata(tssObjectExample, package=\"crisprDesign\")\nguideSet \u003c- addTssAnnotation(guideSet,\n                             tssObject=tssObjectExample)\ntssAnnotation(guideSet)\n```\n\nFor more information, type `?addTssAnnotation`.\n\n\n\n\n## SNP information\n\nCommon single-nucleotide polymorphisms (SNPs) can change the on-target and\noff-target properties of gRNAs by altering the binding.\nThe function `addSNPAnnotation` annotates gRNAs with respect to a\nreference database of SNPs (stored in a VCF file), specified by the `vcf`\nargument. \n\nVCF files for common SNPs (dbSNPs) can be downloaded from NCBI on the [dbSNP website](https://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/).\nWe include in this package an example VCF file for common SNPs located in the\nproximity of human gene IQSEC3. This was obtained using the dbSNP151 RefSNP\ndatabase obtained by subsetting around IQSEC.\n\n\n```{r, eval=TRUE,warning=FALSE, message=FALSE}\nvcf \u003c- system.file(\"extdata\",\n                   file=\"common_snps_dbsnp151_example.vcf.gz\",\n                   package=\"crisprDesign\")\nguideSet \u003c- addSNPAnnotation(guideSet, vcf=vcf)\nsnps(guideSet)\n```\n\n\nThe `rs_site_rel` gives the relative position of the SNP with respect \nto the `pam_site`. `allele_ref` and `allele_minor` report the nucleotide of\nthe reference and minor alleles, respectively. `MAF_1000G` and `MAF_TOPMED`\nreport the minor allele frequency (MAF) in the 1000Genomes and TOPMED \npopulations. \n\n\n## Filtering and ranking gRNAs\n\nOnce gRNAs are fully annotated, it is easy to filter out any unwanted gRNAs\nsince `GuideSet` objects can be subsetted like regular vectors in R.\n\nAs an example, suppose that we only want to keep gRNAs that have percent\nGC between 20% and 80% and that do not contain a polyT stretch.\nThis can be achieved using the following lines:\n\n```{r, eval=FALSE}\nguideSet \u003c- guideSet[guideSet$percentGC\u003e=20]\nguideSet \u003c- guideSet[guideSet$percentGC\u003c=80]\nguideSet \u003c- guideSet[!guideSet$polyT]\n```\n\nSimilarly, it is easy to rank gRNAs based on a set of criteria \nusing the regular `order` function.\n\nFor instance, let's sort gRNAs by the CRISPRater on-target score:\n\n```{r, eval=TRUE}\n# Creating an ordering index based on the CRISPRater score:\n# Using the negative values to make sure higher scores are ranked first:\no \u003c- order(-guideSet$score_crisprater) \n# Ordering the GuideSet:\nguideSet \u003c- guideSet[o]\nhead(guideSet)\n```\n\nOne can also sort gRNAs using several annotation columns.\nFor instance, let's sort gRNAs using the CRISPRrater score, but also by \nprioritizing first gRNAs that have no 1-mismatch off-targets:\n\n```{r, eval=TRUE}\no \u003c- order(guideSet$n1, -guideSet$score_crisprater) \n# Ordering the GuideSet:\nguideSet \u003c- guideSet[o]\nhead(guideSet)\n```\n\n\nThe `rankSpacers` function is a convenience function that implements \nour recommended rankings for the SpCas9, enAsCas12a and CasRx nucleases.\nFor a detailed description of our recommended rankings, see the\ndocumentation of `rankSpacers` by typing\n`?rankSpacers`.\n\nIf an Ensembl transcript ID is provided, the ranking function will also\ntake into account the position of the gRNA within the target CDS of \nthe transcript ID in the ranking procedure. Our recommendation is to specify\nthe Ensembl canonical transcript as the representative\ntranscript for the gene. In our example, ENST00000538872 is the canonical\ntranscript for IQSEC3:\n\n```{r, eval=FALSE}\ntx_id \u003c- \"ENST00000538872\"\nguideSet \u003c- rankSpacers(guideSet,\n                        tx_id=tx_id)\n```\n\n\n# CRISPRa/CRISPRi design\n\nFor CRISPRa and CRISPRi applications, the CRISPR nuclease is engineered to \nlose its endonuclease activity, therefore should not introduce double-stranded\nbreaks (DSBs). We will use the dead SpCas9 (dSpCas9) nuclease as an example \nhere. Note that users don't have to distinguish between dSpCas9 and SpCas9\nwhen specifying the nuclease in `crisprDesign` and `crisprBase` as they do \nnot differ in terms of the characteristics stored in the `CrisprNuclease`\nobject.\n\n*CRISPRi*: Fusing dSpCas9 with a Krüppel-associated box (KRAB) domain has been\nshown to be effective at repressing transcription in mammalian cells\n[@crispri]. The dSpCas9-KRAB fused protein is a commonly-used construct to\nconduct CRISPR inhibition (CRISPRi) experiments. To achieve optimal inhibition,\ngRNAs are usually designed targeting the region directly downstream of the gene\ntranscription starting site (TSS).\n\n*CRISPRa*: dSpCas9 can also be used to activate gene expression\nby coupling the dead nuclease with activation factors.\nThe technology is termed CRISPR activation (CRISPRa), and\nseveral CRISPRa systems have been developed \n(see @crispracrisprireview for a review). For optimal activation, gRNAs are\nusually designed to target the region \ndirectly upstream of the gene TSS.  \n\n`crisprDesign` provides functionalities to be able to take into account\ndesign rules that are specific to CRISPRa and CRISPRi applications. The\n`queryTss` function allows to specify genomic coordinates of promoter\nregions. The `addTssAnnotation` annotates gRNAs for known TSSs, and includes\na column named `dist_to_tss` that indicates the distance between the TSS\nposition and the PAM site of the gRNA. For CRISPRi, we recommend targeting \nthe 25-75bp region downstream of the TSS for optimal inhibition. \nFor CRISPRa, we recommend targeting the region 75-150bp upstream of the\nTSS for optimal activation; see [@sanson2018optimized] for more information.\n\nFor more information, please see the following two tutorials:\n\n- [CRISPR activation (CRISPRa) design](https://github.com/crisprVerse/Tutorials/tree/master/Design_CRISPRa)\n- [CRISPR interference (CRISPRi) design](https://github.com/crisprVerse/Tutorials/tree/master/Design_CRISPRi)\n\n# CRISPR base editing with BE4max\n\n\nWe illustrate the CRISPR base editing (CRISPRbe) functionalities \nof `crisprDesign` by designing and characterizing gRNAs targeting\nIQSEC3 using the cytidine base editor BE4max [@koblan2018improving]. \n\nWe first load the BE4max `BaseEditor` object available in `crisprBase`:\n\n```{r}\ndata(BE4max, package=\"crisprBase\")\nBE4max\n```\n\nThe editing probabilities of the base editor BE4max are stored in a matrix \nwhere rows correspond to the different nucleotide substitutions, and columns\ncorrespond to the genomic coordinate relative to the PAM site. \nThe `editingWeights` function from `crisprBase` allows to retrieve \nthose probabilities. One can see that C to T editing is optimal \naround 15 nucleotides upstream of the PAM site for the BE4max base editor:\n\n```{r}\ncrisprBase::editingWeights(BE4max)[\"C2T\",]\n```\n\nWe obtain a `GuideSet` object using the first exon of the IQSEC3 \ngene and retain only the first 2 gRNAs for the sake of time:\n\n```{r}\ngr \u003c- queryTxObject(txObject=grListExample,\n                    featureType=\"cds\",\n                    queryColumn=\"gene_symbol\",\n                    queryValue=\"IQSEC3\")\ngs \u003c- findSpacers(gr[1],\n                  bsgenome=bsgenome,\n                  crisprNuclease=BE4max)\ngs \u003c- gs[1:2]\n```\n\nThe function `addEditedAlleles` finds, characterizes, and scores predicted\nedited alleles for each gRNA, for a chosen transcript. It requires a \ntranscript-specific annotation that can be obtained using the \nfunction `getTxInfoDataFrame`. Here, we will perform the\nanalysis using the main isoform of IQSEC3 (transcript id ENST00000538872).\n\n\nWe first get the transcript table for ENST00000538872, \n\n```{r}\ntxid \u003c- \"ENST00000538872\"\ntxTable \u003c- getTxInfoDataFrame(tx_id=txid,\n                              txObject=grListExample,\n                              bsgenome=bsgenome)\nhead(txTable)\n```\n\nand then add the edited alleles annotation to the `GuideSet`:\n\n```{r}\neditingWindow \u003c- c(-20,-8)\ngs \u003c- addEditedAlleles(gs,\n                       baseEditor=BE4max,\n                       txTable=txTable,\n                       editingWindow=editingWindow)\n```\n\nThe `editingWindow` argument specifies the window of editing that\nwe are interested in. When not provided, it uses the default window\nprovided in the `BaseEditor` object. Note that providing large windows \ncan exponentially increase computing time as the number of possible \nalleles grows exponentially.Let's retrieve the edited alleles for the \nfirst gRNA:\n\n```{r}\nalleles \u003c- editedAlleles(gs)[[1]]\n```\n\nIt is a `DataFrame` object that contains useful metadata information:\n\n\n```{r}\nmetadata(alleles)\n```\n\nThe `wildtypeAllele` reports the unedited nucleotide sequence of the\nregion specified by the editing window (with respect to the gRNA PAM site).\nIt is always reported from the 5' to 3' direction on the strand corresponding \nto the gRNA strand. The `start` and `end` specify the corresponding \ncoordinates on the transcript. \n\nLet's look at the edited alleles:\n\n```{r}\nhead(alleles)\n```\n\nThe `DataFrame` is ordered so that the top predicted alleles \n(based on the `score` column) are shown first. The `score` \nrepresents the likelihood of the edited allele to occur relative\nto all possible edited alleles, and is calculated using the editing\nweights stored in the `BE4max` object. The `seq` column represents \nthe edited nucleotide sequences. Similar to the `wildtypeAllele` above, \nthey are always reported from the 5' to 3' direction on the strand \ncorresponding to the gRNA strand. The `variant` column indicates the \nfunctional consequence of the editing event (silent, nonsense or\nmissense mutation). In case an edited allele leads to multiple \nediting events, the most detrimental mutation (nonsense over missense,\nmissense over silent) is reported. The `aa` column reports the result\nedited amino acid sequence. \n\n\n\nNote that several gRNA-level aggregate scores have also been added \nto the `GuideSet` object when calling `addEditedAlleles`:\n\n```{r}\nhead(gs)\n```\n\nThe `score_missense`, `score_nonsense` and `score_silent` columns \nrepresent aggregated scores for each of the mutation type. They were\nobtained by summing adding up all scores for a given mutation type \nacross the set of edited alleles for a given gRNA. The `maxVariant`\ncolumn indicates the most likely to occur mutation type for a given \ngRNA, and is based on the maximum aggregated score, which is stored \nin `maxVariantScore`. For instance, for spacer_1, the higher score \nis the `score_missense`, and therefore `maxVariant` is set to missense.  \n\n\nFor more information, please see the following tutorial:\n\n- [CRISPR base editing (CRISPRbe) design](https://github.com/crisprVerse/Tutorials/tree/master/Design_CRISPRbe)\n\n\n\n# CRISPR knockdown with Cas13d\n\n\nIt is also possible to design gRNAs for RNA-targeting nucleases using \n`crisprDesign`. In contrast to DNA-targeting nucleases, the target spacer \nis composed of mRNA sequences instead of DNA genomic sequences. \n\nWe illustrate the functionalities of `crisprDesign` for RNA-targeting \nnucleases by designing gRNAs targeting IQSEC3 using the CasRx (RfxCas13d) nuclease [@cas13d]. \n\n\nWe first load the CasRx `CrisprNuclease` object from `crisprBase`:\n\n```{r}\ndata(CasRx, package=\"crisprBase\")\nCasRx\n```\n\nThe PFS sequence (the equivalent of a PAM sequence for RNA-targeting \nnucleases) for CasRx is `N`, meaning that there is no specific PFS sequences preferred by CasRx. \n\n\nWe will now design CasRx gRNAs for the transcript ENST00000538872 of IQSEC3.\n\nLet's first extract all mRNA sequences for IQSEC3:\n\n\n```{r}\ntxid \u003c- c(\"ENST00000538872\",\"ENST00000382841\")\nmrnas \u003c- getMrnaSequences(txid=txid,\n                          bsgenome=bsgenome,\n                          txObject=grListExample)\nmrnas\n```\n\n\nWe can use the usual function `findSpacers` to design gRNAs, and we\nonly consider a random subset of 100 gRNAs for the sake of time:\n\n```{r}\ngs \u003c- findSpacers(mrnas[[\"ENST00000538872\"]],\n                  crisprNuclease=CasRx)\ngs \u003c- gs[1000:1100]\nhead(gs)\n```\n\nNote that all protospacer sequences are located on the original strand \nof the mRNA sequence. For RNA-targeting nucleases, the spacer and \nprotospacer sequences are the reverse complement of each other:\n\n\n```{r}\nhead(spacers(gs))\nhead(protospacers(gs))\n```\n\nThe `addSpacerAlignments` can be used to perform an off-target search \nacross all mRNA sequences using the argument `custom_seq`. Here, for \nthe sake of time, we only perform an off-target search to the 2 \nisoforms of IQSEC3 specified by the `mRNAs` object:\n\n```{r}\ngs \u003c- addSpacerAlignments(gs,\n                          aligner=\"biostrings\",\n                          txObject=grListExample,\n                          n_mismatches=1,\n                          custom_seq=mrnas)\ntail(gs)\n```\n\nThe columns `n0_gene` and `n0_tx` report the number of on-targets at \nthe gene- and transcript-level, respectively. For instance, `spacer_1095` \nmaps to the two isoforms of IQSEC3 has `n0_tx` is equal to 2:\n\n\n```{r}\nonTargets(gs[\"spacer_1095\"])\n```\n\n\nNote that one can also use the `bowtie` aligner to perform an off-target \nsearch to a set of mRNA sequences. This requires building a transcriptome\nbowtie index first instead of building a genome index. \nSee the `crisprBowtie` vignette for more detail. \n\n\nFor more information, please see the following tutorial:\n\n- [CRISPR knockdown (CRISPRkd) design with CasRxdesign](https://github.com/crisprVerse/Tutorials/tree/master/Design_CRISPRkd_CasRx)\n\n\n\n\n# Design for optical pooled screening (OPS)\n\n\nOptical pooled screening (OPS) combines image-based sequencing \n(in situ sequencing) of gRNAs and optical phenotyping on the \nsame physical wells [@ops].  In such experiments, gRNA spacer \nsequences are partially sequenced from the 5 prime end. From a\ngRNA design perspective, additional gRNA design constraints are\nneeded to ensure sufficient dissimilarity of the truncated spacer \nsequences. The length of the truncated sequences, which corresponds\nto the number of sequencing cycles, is fixed and chosen by the experimentalist.\n\n\nTo illustrate the functionalities of `crisprDesign` for designing OPS\nlibraries, we use the `guideSetExample`.\nWe will design an OPS library with 8 cycles. \n\n```{r}\nn_cycles=8\n```\n\n\nWe add the 8nt OPS barcodes to the GuideSet using the `addOpsBarcodes` function:\n\n```{r}\ndata(guideSetExample, package=\"crisprDesign\")\nguideSetExample \u003c- addOpsBarcodes(guideSetExample,\n                                  n_cycles=n_cycles)\nhead(guideSetExample$opsBarcode)\n```\n\nThe function `getBarcodeDistanceMatrix` calculates the nucleotide distance \nbetween a set of query barcodes and a set of target barcodes. The type of \ndistance (hamming or levenshtein) can be specified using the `dist_method` \nargument. The Hamming distance (default) only considers substitutions when\ncalculating distances, while the Levenshtein distance allows insertions and \ndeletions. \n\nWhen the argument `binnarize` is set to `FALSE`, the return object is a \nmatrix of pairwise distances between query and target barcodes:\n\n\n```{r}\nbarcodes \u003c- guideSetExample$opsBarcode\ndist \u003c- getBarcodeDistanceMatrix(barcodes[1:5],\n                                 barcodes[6:10],\n                                 binnarize=FALSE)\nprint(dist)\n```\n\n\nWhen `binnarize` is set to `TRUE` (default), the matrix of distances is\nbinnarized so that 1 indicates similar barcodes, and 0 indicates \ndissimilar barcodes. The `min_dist_edit` argument specifies the minimal\ndistance between two barcodes to be considered dissimilar:\n\n```{r}\ndist \u003c- getBarcodeDistanceMatrix(barcodes[1:5],\n                                 barcodes[6:10],\n                                 binnarize=TRUE,\n                                 min_dist_edit=4)\nprint(dist)\n```\n\nThe `designOpsLibrary` allows users to perform a complete end-to-end \nlibrary design; see `?designOpsLibrary` for documentation. \n\n\nFor more information, please see the following tutorial:\n\n- [Design for OPS](https://github.com/crisprVerse/Tutorials/tree/master/Design_OPS)\n\n\n\n\n# Design of gRNA pairs with the \\code{PairedGuideSet} object\n\nThe `findSpacerPairs` function in `crisprDesign` enables the design of\npairs of gRNAs and works similar to `findSpacers`. As an example, we\nwill design candidate pairs of gRNAs that target a small locus located\non chr12 in the human genome:\n\n```{r}\nlibrary(GenomicRanges)\nlibrary(BSgenome.Hsapiens.UCSC.hg38)\nlibrary(crisprBase)\nbsgenome \u003c- BSgenome.Hsapiens.UCSC.hg38\n```\n\n\nWe first specify the genomic locus:\n```{r}\ngr \u003c- GRanges(c(\"chr12\"),\n              IRanges(start=22224014, end=22225007))\n```\n\nand find all pairs using the function `findSpacerPairs`:\n\n```{r}\npairs \u003c- findSpacerPairs(gr, gr, bsgenome=bsgenome)\n```\n\nThe first and second arguments of the function specify the which \ngenomic region the first and second gRNA should target, respectively.\nIn our case, we are targeting the same region with both gRNAs. The \nother arguments of the function are similar to the `findSpacers` \nfunction described below. \n\nThe output object is a `PairedGuideSet`, which can be thought of a \nlist of two `GuideSet`:\n\n```{r}\npairs\n```\n\nThe first and second `GuideSet` store information about gRNAs at position\n1 and position 2, respectively. They can be accessed using the `first`\nand `second` functions:\n\n```{r}\ngrnas1 \u003c- first(pairs)\ngrnas2 \u003c- second(pairs)\ngrnas1\ngrnas2\n```\n\nThe `pamOrientation` function returns the PAM orientation of the pairs:\n\n```{r}\nhead(pamOrientation(pairs))\n```\n\nand takes 4 different values: `in` (for PAM-in configuration) `out` \n(for PAM-out configuration), `fwd` (both gRNAs target the forward strand)\nand `rev` (both gRNAs target the reverse strand). \n\nThe function `pamDistance` returns the distance between the PAM sites of\nthe two gRNAs. The function `cutLength` returns the distance between the\ncut sites of the two gRNAs. The function `spacerDistance` returns the \ndistance between the two spacer sequences of the gRNAs.\n\n\nFor more information, please see the following tutorial:\n\n- [Paired gRNA design](https://github.com/crisprVerse/Tutorials/tree/master/Design_PairedGuides)\n\n\n# Miscellaneous design use cases\n\n## Design with custom sequences\n\n`crisprDesign` also allows gRNA design for DNA sequences without\ngenomic context (such as a synthesized DNA construct). See `?findSpacers`\nfor more information, and here's an example:\n\n```{r}\nseqs \u003c- c(seq1=\"AGGCGGAGGCCCGACCCGGGCGCGGGGCGGCGC\",\n          seq2=\"AGGCGGAGGCCCGACCCGGGCGCGGGAAAAAAAGGC\")\ngs \u003c- findSpacers(seqs)\nhead(gs)\n```\n\n## Off-target search in custom sequences\n\nOne can also search for off-targets in a custom sequence as follows:\n\n\n```{r}\nontarget \u003c- \"AAGACCCGGGCGCGGGGCGGGGG\"\nofftarget \u003c- \"TTGACCCGGGCGCGGGGCGGGGG\"\ngs \u003c- findSpacers(ontarget)\ngs \u003c- addSpacerAlignments(gs,\n                          aligner=\"biostrings\",\n                          n_mismatches=2,\n                          custom_seq=offtarget)\nhead(alignments(gs))\n```\n\n\nFor more information, please see the following tutorial:\n\n- [Working with custom DNA sequences](https://github.com/crisprVerse/Tutorials/tree/master/Design_Custom_Sequence)\n\n\n\n\n\n# Session Info\n\n```{r}\nsessionInfo()\n```\n\n# References\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrisprverse%2Fcrisprdesign","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcrisprverse%2Fcrisprdesign","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrisprverse%2Fcrisprdesign/lists"}