{"id":28461838,"url":"https://github.com/pinellolab/crispr-millipede-target","last_synced_at":"2025-10-09T13:03:39.624Z","repository":{"id":256239501,"uuid":"739071789","full_name":"pinellolab/CRISPR-millipede-target","owner":"pinellolab","description":"Calculate the enrichment scores of CRISPR alleles and variants from direct target amplicon-sequencing data using Bayesian linear regression model \"millipede\".","archived":false,"fork":false,"pushed_at":"2024-11-27T20:28:48.000Z","size":7558,"stargazers_count":1,"open_issues_count":2,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-07-03T14:46:58.227Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pinellolab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-04T17:49:51.000Z","updated_at":"2024-11-27T20:28:52.000Z","dependencies_parsed_at":"2024-09-09T20:04:14.450Z","dependency_job_id":"6238d887-5c23-4313-964d-a1a10c8ecdea","html_url":"https://github.com/pinellolab/CRISPR-millipede-target","commit_stats":null,"previous_names":["pinellolab/crispr-millipede-target"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/pinellolab/CRISPR-millipede-target","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pinellolab%2FCRISPR-millipede-target","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pinellolab%2FCRISPR-millipede-target/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pinellolab%2FCRISPR-millipede-target/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pinellolab%2FCRISPR-millipede-target/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pinellolab","download_url":"https://codeload.github.com/pinellolab/CRISPR-millipede-target/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pinellolab%2FCRISPR-millipede-target/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279001535,"owners_count":26083102,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-09T02:00:07.460Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-07T04:07:40.318Z","updated_at":"2025-10-09T13:03:39.618Z","avatar_url":"https://github.com/pinellolab.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ![PyPI - Version](https://img.shields.io/pypi/v/crispr-millipede) CRISPR-Millipede User Documentation\r\n\u003cimg src=\"https://github.com/user-attachments/assets/d54fbe8f-2c0e-4354-a209-eab031c3bd64\" alt=\"CRISPR-Millipede logo\" width=\"500\"\u003e\u003c/img\u003e \r\n\r\n# \r\n \r\n**CRISPR-Millipede** was developed by the *Pinello Lab* as an easy-to-use Python package for \u003cins\u003e processing targeted amplicon-sequencing of tiled sequences from base-editing tiling screens to identify functional nucleotides\u003c/ins\u003e. By providing amplicon-sequencing of installed alleles from multiple phenotypic populations, CRISPR-Millipede identifies the single-variants that contribute to differences in phenotype. See [this preprint](https://www.biorxiv.org/content/10.1101/2024.09.09.612085v1) for more information on this method! It is expected that you are familiar with Python, command-line tools, and CRISPR screens to follow this guide.\r\n\r\n \r\n  \r\n**Sections**\r\n- [Notes on Experimental Design and Expected Inputs](#notes-on-experimental-design-and-expected-inputs)\r\n- [Installation](#installation)\r\n- [System Requirements](#system-requirements)\r\n- [Instructions](#instructions)\r\n  -  [STEP 1: Run CRISPResso2 to generate allele tables](#step-1-run-crispresso2-to-generate-allele-tables)\r\n  -  [STEP 2: Encode the CRISPResso2 outputs into matrices](#step-2-encode-the-crispresso2-outputs-into-matrices)\r\n  -  [STEP 3: Perform modelling of the encoded dataset](#step-3-perform-modelling-of-the-encoded-dataset)\r\n  -  [STEP 4: Visualization using boardplots](#step-4-generate-board-plots)\r\n  -  [STEP 5: PyDESEQ2 allelic analysis](#step-5-PyDESEQ-based-analysis)\r\n  \r\n### Notes on Experimental Design and Expected Inputs\r\n*Skip this and scroll further down if interested in the tool usage*\r\n- This tool is best used for pooled CRISPR saturation mutagenesis screens of a single focused region. \r\n- The length of the mutagenized region depends on the desired sequencing read length (i.e. paired-end 150bp sequencing has a max mutagenesis length of 300bp, however, it is desired that there is as much overlap of the paired-ends to maximize sequencing quality). You will perform targeted amplicon-sequencing of your intended mutagenized region. Ensure that no editing occurs at the primer binding sites, and ensure that the primers are tested and optimized beforehand (i.e. difficult-to-amplify or difficult-to-sequence regions may not be suitable for this method, therefore it is essential that this is tested prior to screening).\r\n- It is suggested that you also perform the standard sequencing of the guide RNA to calculate guide RNA enrichment scores in tandem. Therefore, you will split your genomic DNA into two different library preparation approaches: guide RNA sequencing and the aforementioned direct target sequencing.\r\n- The type of mutagenesis is best suited to single-nucleotide mutagenesis (i.e. base-editing and prime-editing). The method has not been extensively tested on in-del mutagenesis. \r\n- This model was developed and tested on FACS-sorted based screens rather than proliferation screens, however the model may still work for proliferation screens by comparing samples between two separate timepoints.\r\n- Ensure that you have sufficient cell coverage for sequencing, especially if doing both guide RNA and direct target sequencing. You should preferably have roughly 1000 cells * number of guide RNAs in your library for EACH guide RNA and direct sequencing approach (therefore 2000 cells * number of guide RNAs if doing both sequencing approaches) for EACH sample. The cell coverage depends on the editing efficiency and the expected effect sizes. Typically, the sorted population with the phenotypic change from the baseline after perturbation will have the lowest coverage, therefore you should ensure that you have sufficient cell counts in all populations prior to sequencing (by modifying your FACS gates while still maintaining separation between your negative and positive control gRNAs or by simply increasing input cell amount at the expense of longer sort time).\r\n- Ensure that you have sufficient biological replicates (at least 3 replicates).\r\n- It is not necessary to haploidize your region to have single-copy alleles, though this may reduce the noise of the phenotypic scores for each sequenced allele due to certain homozygosity of the sequenced allele. \r\n- While this method is robust to biases in different editing efficiencies among your guide RNAs since alleles are directly sequenced, ensuring high editing efficiency will increase the per-allele coverage in your samples thereby reducing the necessary cell coverage and increasing statistical power.\r\n\r\nSee **Figure a** below for a schematic of the experimental design:\r\n\r\n  \u003cimg src=\"https://github.com/user-attachments/assets/6ec0a352-aeb2-453b-81d4-ab812c88399b\" alt=\"CRISPR-CLEAR framework\" width=\"300\"\u003e\u003c/img\u003e\r\n    \r\n  \u003cem\u003e**Figure a:** The workflow illustrates the key steps from guide RNA design to data analysis. First, cells stably expressing a base editor are transduced with a library of guide RNAs tiling the regulatory sequence. After editing, cells are FACS-sorted based on the expression of the target protein. Genomic DNA is extracted from sorted cells. Next-generation libraries are prepared to quantify sgRNA counts and to measure the distribution of edits at the endogenous sequence in the sorted population of cells. The left pathway shows the standard approach using sgRNA count-based readout and the CRISPR-SURF pipeline for deconvolution of functional regions. The right pathway depicts the CRISPR-CLEAR approach using direct allele-based readout and the CRISPR-Millipede pipeline, enabling precise genotype-to-phenotype linkage through per-allele and per-nucleotide analysis.\u003c/em\u003e\r\n\r\nAfter performing the screen, you should have targetted amplicon-sequencing FASTQs for each of your phenotypic populations (i.e. different FACS gates along with the pre-sort sample) for multiple biological replicates. An overview of the pipeline is to 1) first quality-control using FASTQC to ensure sufficient read quality of all samples, 2) run all the samples through CRISPResso2 to characterize the introduced alleles in your samples, 3) encode the alleles in a numerical representation for Millipede modelling 4) and lastly perform the Millipede modelling to attain your results. See **Figure b** below for a schematic of the pipeline steps:\r\n\r\n\u003cimg src=\"https://github.com/user-attachments/assets/0cbb44c8-e073-44c3-be54-fa6239871895\" alt=\"CRISPR-CLEAR framework\" width=\"300\"\u003e\u003c/img\u003e\r\n\r\n\u003cem\u003e**Figure b:** Schematic of CRISPR-Millipede workflow.\u003c/em\u003e\r\n\r\n### Installation\r\n\r\nCRISPResso2 is required for first step (a *Pinello Lab* tool), to prepare the input for CRISPR-Millipede. See the [CRISPResso2 repository](https://github.com/pinellolab/CRISPResso2) for installation instructions. You can install this in a different conda environment than CRISPR-Millipede (Preferred). If you want it in the same environment install CRISRPresso2 before CRISPR-Millipede. \r\n\r\nCRISPR-Millipede requires **Python versions \u003e=3.10,\u003c3.12** which can be installed from the [Python download page](https://www.python.org/downloads/) or via **Conda** (see installation of Conda [here](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html)). Optionally, can use [**mamba**](https://github.com/mamba-org/mamba/blob/main/README.md) for faster installation. For installing Python via Conda:\r\n\r\n```conda install python=3.10```.\r\n\r\nAdditionally, CRISPR-Millipede requires the **PyTorch**, which can be installed via **Conda**. If your computer does not have a CPU, install the CPU-version of PyTorch:\r\n\r\n```conda install pytorch```\r\n\r\nIf you have a GPU, ensure that you have CUDA installed by checking the CUDA version (for example version 11.8):\r\n\r\n```nvcc --version```\r\n\r\nIf you don't have CUDA installed, follow the [NVIDIA CUDA installation guide](https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html).\r\n\r\nThen, install the appropriate GPU version of PyTorch with the correct version of the **pytorch-cuda** based on the CUDA version installed on your OS (for example version 11.8):\r\n\r\n```conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia```\r\n\r\nOnce you have all Python and PyTorch dependencies installed, CRISPR-Millipede can easily be installed from PyPi which should only take a few minutes. PIP will ensure that all Python package dependencies are installed:\r\n\r\n```pip install crispr-millipede==0.1.97```, \r\n\r\n***Did you also directly sequence your guide RNAs?*** It is recommended you do so to compare against the CRISPR-Millipede results from target amplicon-sequencing. You could map your guide sequences using tools from the *Pinello Lab* such as [CRISPR-Correct](https://github.com/pinellolab/CRISPR-Correct) and analyze the resulting counts using [CRISPR-SURF](https://github.com/pinellolab/CRISPR-SURF/tree/master) as done in the original paper! \r\n\r\nPyDESeq2 can also be installed from PyPi, using the following command:\r\n\r\n```pip install pydeseq2```\r\n\r\n### System Requirements\r\nCRISPR-Millipede can run on [any operating system where Python versions \u003e=3.10,\u003c3.12 can be installed](https://www.python.org/downloads/operating-systems/) and where [PyTorch can be installed](https://pytorch.org/get-started/locally/). To speed up model performance, CRISPR-Millipede can utilize both CPUs (for multi-threading) and GPUs (for model training) and is highly recommended to allow the pipeline to run in the span of a couple hours, though the tool can still work on single core non-GPU computers but may run in the span of a day for each run attempt depending on the FASTQ sizes. \r\n\r\n### Installation and Run Time\r\nOn a Macbook Pro (M2 Chip with 32 GB ram)\r\n- Installation takes about 1 min 20 secs via pip after installing PyTorch\r\n- Running Step 1 (CRISPResso2) takes about 5 mins on sg218 example\r\n- Running Step 2 (Encoding) takes about 20 mins on sg218 example\r\n- Running Step 3 (Millipede: model_run = cmm.MillipedeModelExperimentalGroup(experiments_inputdata=model_input_data, device=cmm.MillipedeComputeDevice.CPU) takes about 2 minutes for the sg218 example in the notebook\r\n\r\n\r\n## Instructions\r\n\r\n### STEP 1: Run CRISPResso2 to generate allele tables\r\n*We need to take the raw amplicon-sequencing data and encode it into an input that CRISPR-Millipede accepts. It is suggested that your amplicon-sequencing data is quality-controlled using [FASTQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to ensure sequencing quality.*\r\n\r\nCRISPR-Millipede's encoding step takes in as input the allele frequency tables produced from [CRISPResso2](https://github.com/pinellolab/CRISPResso2), a *Pinello Lab* tool for processing amplicon-sequencing data from CRISPR experiments. Refer to the [CRISPResso2 documentation](https://github.com/pinellolab/CRISPResso2) for instructions on how to run CRISPResso2, which may depend on the type of CRISPR editing performed in your experiment.\r\n\r\n\r\nExample command for a base-editing experiment:\r\n```\r\nCRISPRessoBatch\r\n  -bs {FASTQ_FILENAME} -a {AMPLICON_SEQUENCE} -an {AMPLICON_NAME}\r\n  -q {QUALITY}\r\n  --exclude_bp_from_left {EX_LEFT} --exclude_bp_from_right {EX_RIGHT}\r\n  --no_rerun -n {SCREEN_NAME}\r\n  --min_frequency_alleles_around_cut_to_plot 0.001\r\n  --max_rows_alleles_around_cut_to_plot 500\r\n  -p 20 --plot_window_size 4 --base_editor_output -w 0\r\n  -bo {OUTPUT_DIRECTORY}\r\n```\r\n\r\nRun CRISPResso2 for all samples and replicates. For each sample, CRISPResso2 will produce an allele frequency table named \"Alleles_frequency_table.zip\" which is used as input to the CRISPR-Millipede package. You will need these files in the next step. CRISPResso2 will also produce several other plots characterizing the editing patterns of your samples which will be useful for initial exploration of your data prior to modelling!\r\n\r\n### STEP 2: Encode the CRISPResso2 outputs into matrices\r\n*The CRISPResso2 output contains a table of alleles and their read counts for each sample. The alleles are represented as strings, though the strings must be encoded into a numerical representation for CRISPR-Millipede modelling.*\r\n\r\nImport and prepare the parameters of the encoding step by passing in the amplicon sequence (required), the acceptable variant types (optional), predicted editing sites (optional), population colummn suffixes for indexing (required), and encoding edge trimming for reducing sequencing background (optional) to the `EncodingParameters` class.\r\n\r\nBelow contains the class definition (and default values) of the `EncodingParameters` that you will need to instantiate:\r\n\r\n```\r\n@dataclass\r\nclass EncodingParameters:\r\n    complete_amplicon_sequence: str # Amplicon sequence string\r\n    population_baseline_suffix: Optional[str] = \"_baseline\" # Typically the population that unedited cells are primarily in. Suffix label\r\n    population_target_suffix: Optional[str] = \"_target\" # The population used to calculate variant enrichment relative to the baseline population. Suffix label\r\n    population_presort_suffix: Optional[str] = \"_presort\" # The un-sorted population used to calculate total editing efficiencies. Suffix label\r\n    wt_suffix: Optional[str] = \"_wt\" # An unedited population to calculate the sequencing error background. Suffix label\r\n    guide_edit_positions: List[int] = field(default_factory=list) # Position of expected editing sites. Positions are relative to the amplicon sequence. 0-based.\r\n    guide_window_halfsize: int = 3 # Expected editing window size. Only edits in range(guide_edit_position-guide_window_halfsize,guide_edit_position+guide_window_halfsize+1) for all positions will be considered for modelling\r\n    minimum_editing_frequency: float = 0 # Frequency of variants to consider for editing, may be useful for removing sequencing background.\r\n    minimum_editing_frequency_population: List[str] = field(default_factory=list) # Population to consider for removal of variants by frequency, i.e. [\"presort\"]\r\n    variant_types: List[Tuple[str, str]] = field(default_factory=list) #  List of variants to consider for modelling. Variants represented as two-value tuple where first index is REF and second index is ALT. i.e.  [(\"A\", \"G\"), (\"T\", \"C\")] for adenine base-editing variants.\r\n    trim_left: int = 0 # Filtering positions on left side of amplicon\r\n    trim_right: int = 0 # Filtering positions on right side of amplicon\r\n    remove_denoised: bool = False # Remove filtered features (from above criteria) from model input.  \r\n```\r\n\r\n\r\nExample of setting encoding parameters:\r\n```\r\nfrom crispr_millipede import encoding as cme\r\n\r\nAMPLICON = \"ACTGACTGACTGACTGACTGACTG\" # Put your complete reference amplicon-sequence here\r\nABE_VARIANT_TYPES = [(\"A\", \"G\"), (\"T\", \"C\")] # Optional: If using an adenine base-editor\r\nCBE_VARIANT_TYPES = [(\"C\", \"T\"), (\"G\", \"A\")] # Optional: If using a cytosine base-editor\r\nencoding_parameters = cme.EncodingParameters(complete_amplicon_sequence=AMPLICON,\r\n                            population_baseline_suffix=\"_baseline\", \r\n                            population_target_suffix=\"_target\", \r\n                            population_presort_suffix=\"_presort\", \r\n                            wt_suffix=\"_wt\", \r\n                            trim_left=20, \r\n                            trim_right=20, \r\n                            variant_types=ABE_VARIANT_TYPES, \r\n                            remove_denoised=True)\r\n```\r\n\r\n\r\nTo load the CRISPResso2 allele frequency tables into CRISPR-Millipede from STEP 1, pass in the `EncodingParameters` object and the CRISPResso2 allele frequency table filenames from STEP 1 for each population. For each population, provide a list of filenames corresponding to each replicate:\r\n\r\n```\r\nencoding_dataframes = cme.EncodingDataFrames(encoding_parameters=encoding_parameters, #  From example above\r\n                                                 reference_sequence=encoding_parameters.complete_amplicon_sequence,\r\n                                                 population_baseline_filepaths=[\"CRISPResso_on_sample_baseline_1/Alleles_frequency_table.zip\", \r\n                                                                                \"CRISPResso_on_sample_baseline_2/Alleles_frequency_table.zip\", \r\n                                                                                \"CRISPResso_on_sample_baseline_3/Alleles_frequency_table.zip\"],\r\n                                                 population_target_filepaths=[\"CRISPResso_on_sample_target_1/Alleles_frequency_table.zip\", \r\n                                                                              \"CRISPResso_on_sample_target_2/Alleles_frequency_table.zip\", \r\n                                                                              \"CRISPResso_on_sample_target_3/Alleles_frequency_table.zip\"],\r\n                                                 population_presort_filepaths=[\"CRISPResso_on_sample_presort_1/Alleles_frequency_table.zip\", \r\n                                                                               \"CRISPResso_on_sample_presort_2/Alleles_frequency_table.zip\", \r\n                                                                               \"CRISPResso_on_sample_presort_3/Alleles_frequency_table.zip\"],\r\n                                                 wt_filepaths=[root_dir + \"CRISPResso_on_sample_wt_1/Alleles_frequency_table.zip\"])\r\n```\r\n\r\n\r\nPerform the encoding:\r\n```\r\nencoding_dataframes.read_crispresso_allele_tables() # This reads in the CRISPResso2 table\r\nencoding_dataframes.encode_crispresso_allele_table(progress_bar=True, cores={CPUS}) # Performs the initial encoding. Replace {CPUs} with the number of CPUs for parallelization on your system. \r\nencoding_dataframes.postprocess_encoding() # Postprocesses the encoding with the filtering criteria from above.\r\n```\r\n\r\nHighly suggested to save the results of the encodings to your drive. Encouraged to include a prefix to version the results. These files will be used as input to the next modelling STEP 3.\r\n```\r\nprefix_label =\"20240916_v1_example_\"\r\n\r\ncme.save_encodings(encoding_dataframes.encodings_collapsed_merged, sort_column=\"#Reads_presort\", filename=prefix_label + \"encoding_dataframes_editor_encodings_rep{}.tsv\")\r\ncme.save_encodings(encoding_dataframes.population_wt_encoding_processed, sort_column=\"#Reads_wt\", filename=prefix_label + \"encoding_dataframes_wt_encodings_rep{}.tsv\")\r\ncme.save_encodings_df(encoding_dataframes.population_baseline_encoding_processed, filename=prefix_label + \"encoding_dataframes_baseline_editor_encodings_rep{}.pkl\")\r\ncme.save_encodings_df(encoding_dataframes.population_target_encoding_processed, filename=prefix_label + \"encoding_dataframes_target_editor_encodings_rep{}.pkl\")\r\ncme.save_encodings_df(encoding_dataframes.population_presort_encoding_processed, filename=prefix_label + \"encoding_dataframes_presort_editor_encodings_rep{}.pkl\")\r\ncme.save_encodings_df(encoding_dataframes.population_wt_encoding_processed, filename=prefix_label + \"encoding_dataframes_wt_encodings_rep{}.pkl\")\r\n```\r\n\r\n### STEP 3: Perform modelling of the encoded dataset\r\n*Now that we have the encoded representation of the alleles, we will now perform Millipede modelling off of this representation. For documentation on the Millipede model sub-package, see [here](https://millipede.readthedocs.io/en/latest/getting_started.html).*\r\n\r\n**Set the model parameters:** Below contains the class definition (and default values) of the `MillipedeDesignMatrixProcessingSpecification` that you will need to instantiate:\r\n\r\n```\r\n@dataclass\r\nclass MillipedeDesignMatrixProcessingSpecification:\r\n    wt_normalization: bool = True # Normalize the read count base on the unedited allele counts\r\n    total_normalization: bool = False # Normalize the read count based on the total sum of all allele counts\r\n    sigma_scale_normalized: bool = False # If using the NormalLikelihoodVariableSelector, determine if the sigma_scale factor will be based on the normalized read count\r\n    decay_sigma_scale: bool = True # Set the sigma_scale factor based on the decay function\r\n    K_enriched: Union[float, List[float], List[List[float]]] = 5 # Set the K_enriched value of the decay function\r\n    K_baseline: Union[float, List[float], List[List[float]]] = 5 # Set the K_baseline value of the decay function\r\n    a_parameter: Union[float, List[float], List[List[float]]] = 300 # Set the a_parameter of the decay function\r\n```\r\n\r\nAdditionally, you will need to specify the type of model as well. Below contains the class definition (and default values) of the `MillipedeModelSpecification` that you will need to instantiate:\r\n\r\n```\r\n@dataclass\r\nclass MillipedeModelSpecification:\r\n    \"\"\"\r\n        Defines all specifications to produce Millipede model(s)\r\n    \"\"\"\r\n    model_types: List[MillipedeModelType] \r\n    replicate_merge_strategy: MillipedeReplicateMergeStrategy\r\n    experiment_merge_strategy: MillipedeExperimentMergeStrategy\r\n    cutoff_specification: MillipedeCutoffSpecification\r\n    design_matrix_processing_specification: MillipedeDesignMatrixProcessingSpecification\r\n    shrinkage_input: Union[MillipedeShrinkageInput, None] = None\r\n    S: float = 1.0 #S parameter\r\n    tau: float = 0.01 #tau parameter\r\n    tau_intercept: float = 1.0e-4\r\n```\r\n\r\nThere are sub-classes you will need to instantiate. For instance, the `MillipedeReplicateMergeStrategy` specifies how multiple replicates are handled during modelling:\r\n```\r\nclass MillipedeReplicateMergeStrategy(Enum):\r\n    \"\"\"\r\n        Defines how separate replicates will be treated during modelling\r\n    \"\"\"\r\n    SEPARATE = \"SEPARATE\" # Replicates are modelled separately; one model per replicate\r\n    SUM = \"SUM\" # (Normalized) counts for all replicates are summed together; one model for all replicates\r\n    COVARIATE = \"COVARIATE\" # Replicates are jointly modelled, though replicate ID is included in the model design matrix \r\n```\r\n\r\n*Recommended to run one version in `MillipedeReplicateMergeStrategy.SEPARATE` to assess individual replicate consistency, then if successful, run a final model in `MillipedeReplicateMergeStrategy.COVARIATE`*\r\n    \r\nLikewise, the `MillipedeExperimentMergeStrategy` specifies how multiple experiments (i.e. screens with different editors) are handled during modelling.\r\n```\r\nclass MillipedeExperimentMergeStrategy(Enum):\r\n    \"\"\"\r\n        Defines how separate experiments will be treated during modelling\r\n    \"\"\"\r\n    SEPARATE = \"SEPARATE\"\r\n    SUM = \"SUM\"\r\n    COVARIATE = \"COVARIATE\"\r\n```\r\n\r\nThe `MillipedeModelType` specifies what likelihoood function to use for model fitting. See the [Millipede documentation](https://millipede.readthedocs.io/en/latest/selection.html) for more information. \r\n```\r\nclass MillipedeModelType(Enum):\r\n    \"\"\"\r\n        Defines the Millipede model likelihood function used\r\n    \"\"\"\r\n    NORMAL = \"NORMAL\"\r\n    NORMAL_SIGMA_SCALED = \"NORMAL_SIGMA_SCALED\"\r\n    BINOMIAL = \"BINOMIAL\"\r\n    NEGATIVE_BINOMIAL = \"NEGATIVE_BINOMIAL\"\r\n```\r\n*We recommend using the NORMAL_SIGMA_SCALED model, you will need to define the K_enriched, K_baseline, a, and decay_sigma_scale paramters to specify how the sigma_scale_factor is calculated.*\r\n\r\nHere is an example of specifying the complete input parameters for modelling:\r\n```\r\nfrom crispr_millipede import encoding as cme\r\nfrom crispr_millipede import modelling as cmm\r\n\r\ndesign_matrix_spec = cmm.MillipedeDesignMatrixProcessingSpecification(\r\n    wt_normalization=False,\r\n    total_normalization=True,\r\n    sigma_scale_normalized=True,\r\n    decay_sigma_scale=True,\r\n    K_enriched=5,\r\n    K_baseline=5,\r\n    a_parameter=0.0005\r\n)\r\n\r\nmillipede_model_specification_set = {\r\n    \"model_specification_1\" : cmm.MillipedeModelSpecification(\r\n        model_types=[cmm.MillipedeModelType.NORMAL_SIGMA_SCALED],\r\n        replicate_merge_strategy=cmm.MillipedeReplicateMergeStrategy.COVARIATE,\r\n        experiment_merge_strategy=cmm.MillipedeExperimentMergeStrategy.SEPARATE,\r\n        S = 5,\r\n        tau = 0.01,\r\n        tau_intercept = 0.0001,\r\n        cutoff_specification=cmm.MillipedeCutoffSpecification(\r\n            per_replicate_each_condition_num_cutoff = 0, \r\n            per_replicate_all_condition_num_cutoff = 1, \r\n            all_replicate_num_cutoff = 0, \r\n            all_experiment_num_cutoff = 0,\r\n            baseline_pop_all_condition_each_replicate_num_cutoff = 3,\r\n            baseline_pop_all_condition_acceptable_rep_count = 2,\r\n            enriched_pop_all_condition_each_replicate_num_cutoff = 3,\r\n            enriched_pop_all_condition_acceptable_rep_count = 2,\r\n            presort_pop_all_condition_each_replicate_num_cutoff = 3,\r\n            presort_pop_all_condition_acceptable_rep_count = 2\r\n        ),\r\n        design_matrix_processing_specification=design_matrix_spec\r\n    )\r\n}\r\n```\r\n\r\n**Load in the encoding data:** Now that you have specified the model inputs, let's load the encoding data in, which should be straightforward:\r\n\r\n```\r\nprefix_label =\"20240916_v1_example_\"\r\nencoding_filename = prefix_label + \"encoding_dataframes_editor_encodings_rep{}.tsv\"\r\n\r\n# This will load in the data\r\nmodel_input_data = cmm.MillipedeInputDataExperimentalGroup(\r\n    data_directory=\"./\", \r\n    enriched_pop_fn_experiment_list = [encoding_filename],\r\n    enriched_pop_df_reads_colname = \"#Reads_target\",\r\n    baseline_pop_fn_experiment_list = [encoding_filename],\r\n    baseline_pop_df_reads_colname = \"#Reads_baseline\", \r\n    presort_pop_fn_experiment_list = [encoding_filename],\r\n    presort_pop_df_reads_colname = '#Reads_presort',\r\n    experiment_labels = [\"editor\"],\r\n    reps = [0,1,2],\r\n    millipede_model_specification_set = millipede_model_specification_set\r\n   )\r\n```\r\n\r\n**Run the model:** Now that you have specified the inputs, we will now run the model. You have the option to use the CPU or GPU for modelling.\r\n\r\n```\r\nmodel_run = cmm.MillipedeModelExperimentalGroup(experiments_inputdata=model_input_data, device=cmm.MillipedeComputeDevice.GPU)\r\n```\r\n\r\n**Explore the results:** The model will provide posterior inclusion probabilities (PIP) and beta coefficient scores for each feature/variant that was included in the model and not filtered out during the encoding step:\r\n\r\n```\r\nbeta_df = paired_end_experiments_models_denoised.millipede_model_specification_set_with_results['model_specification_1'].millipede_model_specification_result_input[0].millipede_model_specification_single_matrix_result[cmm.MillipedeModelType.NORMAL_SIGMA_SCALED].beta\r\npip_df = paired_end_experiments_models_denoised.millipede_model_specification_set_with_results['model_specification_1'].millipede_model_specification_result_input[0].millipede_model_specification_single_matrix_result[cmm.MillipedeModelType.NORMAL_SIGMA_SCALED].pip\r\nsigma_hit_table = paired_end_experiments_models_denoised.millipede_model_specification_set_with_results[\"joint_replicate_per_experiment_models\"].millipede_model_specification_result_input[0].millipede_model_specification_single_matrix_result[cmm.MillipedeModelType.NORMAL_SIGMA_SCALED].summary\r\n\r\nsigma_hit_table.to_csv('MillipedeOutput.csv', index=True)\r\n\r\n```\r\n**Model Output Table:** The output table (sigma_hit_table) will look like this where for each covariate you are given a PIP, Beta, Conditional PIP, and Conditional Beta\r\n\r\n\u003cimg width=\"721\" alt=\"Screenshot 2024-10-08 at 5 17 55 PM\" src=\"https://github.com/user-attachments/assets/9b946fc2-c7dd-43c6-98e7-4bf90864de01\"\u003e\r\n\r\n### STEP 4: Generate Board Plots\r\n\r\n**Board Plots:** Board Plots can be generated by using the board plot function provided in CRISPR-Millipede. Board Plots require the millipede table, presort, and wt editing frequencies which can be generated using the functions below. \r\n\r\n```\r\npaired_merged_raw_encodings = cmm.RawEncodingDataframesExperimentalGroup().read_in_files_constructor(\r\n    enriched_pop_fn_encodings_experiment_list = [\"./encoding_dataframes_target_editor_encodings_rep{}.pkl\"],\r\n    baseline_pop_fn_encodings_experiment_list = [\"./encoding_dataframes_baseline_editor_encodings_rep{}.pkl\"],\r\n    presort_pop_fn_encodings_experiment_list = [\"./encoding_dataframes_presort_editor_encodings_rep{}.pkl\"],\r\n    experiment_labels = [\"ABE8e\"],\r\n    ctrl_pop_fn_encodings=\"./encoding_dataframes_wt_editor_encodings_rep{}.pkl\",\r\n    ctrl_pop_labels=\"WT\",\r\n    reps = [0,1,2],\r\n   )\r\npaired_merged_raw_encodings_editing_freqs.presort_pop_encoding_editing_per_variant_freq_avg[0].to_csv('presort_editing_freqs_avg_editor.csv')\r\npaired_merged_raw_encodings_editing_freqs.baseline_pop_encoding_editing_per_variant_freq_avg[0].to_csv('baseline_editing_freqs_avg_editor.csv')\r\npaired_merged_raw_encodings_editing_freqs.enriched_pop_encoding_editing_per_variant_freq_avg[0].to_csv('target_editing_freqs_avg_editor.csv')\r\npaired_merged_raw_encodings_editing_freqs.ctrl_pop_encoding_editing_per_variant_freq_avg[0].to_csv('wt_editing_freqs_avg_editor.csv')\r\n\r\ncmm.plot_millipede_boardplot(editorName (ABE8e or evoCDA), 'MillipedeOutput.csv', 'presort_editing_freqs_avg_editor.csv' , 'wt_editing_freqs_avg_editor.csv', start,end, AMPLICON, outputPath = \"Boardplot.svg\")\r\n\r\n```\r\n\u003cimg width=\"668\" alt=\"Screenshot 2024-10-09 at 2 37 18 PM\" src=\"https://github.com/user-attachments/assets/a698298c-3d54-49b6-b94b-cdf3c6d329e4\"\u003e\r\n\r\n### STEP 5: PyDESeq2 based analysis\r\nThe encoded representation of the alleles can also be fed into PyDESeq2, to calculate the differential distribution of each allele across the sorted populations. For documentation on PyDESeq2, see [here](https://pydeseq2.readthedocs.io/en/latest/index.html#).\r\n\r\nPyDESeq2 takes in a count and design matrix, along with several parameters:\r\n\r\n```\r\ninference = DefaultInference(n_cpus=8)\r\ndds = DeseqDataSet(\r\n    counts=count_df,\r\n    metadata=metadata_df,\r\n    design_factors=\"condition\",\r\n    refit_cooks=True,\r\n    inference=inference,\r\n    # n_cpus=8, # n_cpus can be specified here or in the inference object\r\n)\r\n```\r\n**See [notebooks/STEP5_ABE8e_DESeq2_Demo.ipynb](https://github.com/pinellolab/CRISPR-millipede-target/blob/master/notebooks/STEP5_ABE8e_DESeq2_Demo.ipynb) for instructions on how to format the input matrices and run PyDESeq2.**\r\n\r\nAfter running pyDESeq2, we can visualize a volcano plot of the per-allele scores derived through the model:\r\n\r\n```\r\ndef contains_edit_special(edit, edit2):\r\n    colors = []\r\n    sizes = []\r\n    \r\n    subset_df = results_df.copy()\r\n    \r\n    for index, row in subset_df.iterrows():\r\n        if len(set(edit).intersection(set(index.split(\",\")))) \u003e 0:\r\n            colors.append(\"#00AEEF\")\r\n            sizes.append(40)\r\n        elif len(set(edit2).intersection(set(index.split(\",\")))) \u003e 0:\r\n            colors.append(\"#EC008C\")\r\n            sizes.append(40)\r\n        else:\r\n            colors.append(\"gray\")\r\n            sizes.append(40)\r\n            subset_df.drop(index, inplace=True)\r\n    \r\n    # Create the plot\r\n    plt.figure(figsize=(8, 5))\r\n    \r\n    # Scatter plot\r\n    plt.scatter(results_df['log2FoldChange'] * -1, \r\n                results_df['-10 * log(pvalue)'],\r\n                c=colors, s=sizes, alpha=0.3)\r\n    \r\n    # Set x-axis to log2 scale\r\n    plt.xscale('symlog', base=2)\r\n    \r\n    # Set axis labels and title\r\n    plt.xlabel(\"Log2 Fold Change [CD19+ vs CD19-]\", fontsize=14)\r\n    plt.ylabel(\"-10 * log10(pvalue)\", fontsize=14)\r\n    plt.title(\"Volcano Plot\", fontsize=16)\r\n    \r\n    # Set x-axis limits and ticks\r\n    plt.xlim(-10, 10)\r\n    \r\n    # Set y-axis limits\r\n    plt.ylim(0, 30)\r\n    ax = plt.gca()  # Get current axis\r\n    ax.spines['top'].set_visible(False)\r\n    ax.spines['right'].set_visible(False)\r\n\r\n    # Save the figure\r\n    plt.savefig(\"ABE8e_allelic_analysis_w_MillipedeHits.svg\")\r\n\r\n    # Adjust layout and display the plot\r\n    plt.tight_layout()\r\n    plt.show()\r\n    \r\n    # Display the subset dataframe\r\n    display(subset_df)\r\n```\r\n\r\nThe parameters \"edit1\" and \"edit2\" can be used to selectively color alleles that exhibit certain sets of edits:\r\n\r\n```\r\ncontains_edit_special([\"223A\u003eG\", \"230A\u003eG\"], [\"151A\u003eG\"])\r\n```\r\n\r\n![image](https://github.com/user-attachments/assets/32c9451a-bf65-45f4-bf2c-87317ef920fa)\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpinellolab%2Fcrispr-millipede-target","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpinellolab%2Fcrispr-millipede-target","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpinellolab%2Fcrispr-millipede-target/lists"}