{"id":28753247,"url":"https://github.com/google-deepmind/nuclease_design","last_synced_at":"2025-06-17T00:39:20.175Z","repository":{"id":227943807,"uuid":"760944063","full_name":"google-deepmind/nuclease_design","owner":"google-deepmind","description":"ML-guided enzyme engineering","archived":false,"fork":false,"pushed_at":"2025-05-28T19:45:17.000Z","size":39893,"stargazers_count":63,"open_issues_count":2,"forks_count":17,"subscribers_count":10,"default_branch":"main","last_synced_at":"2025-05-28T20:43:55.022Z","etag":null,"topics":["engineering","learning","machine","protein"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/google-deepmind.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-02-21T00:14:12.000Z","updated_at":"2025-05-26T06:17:45.000Z","dependencies_parsed_at":"2024-04-23T21:11:11.686Z","dependency_job_id":"d2c1b833-e19b-4dca-ac45-284c34695fc8","html_url":"https://github.com/google-deepmind/nuclease_design","commit_stats":null,"previous_names":["google-deepmind/nuclease_design"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/google-deepmind/nuclease_design","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-deepmind%2Fnuclease_design","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-deepmind%2Fnuclease_design/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-deepmind%2Fnuclease_design/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-deepmind%2Fnuclease_design/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/google-deepmind","download_url":"https://codeload.github.com/google-deepmind/nuclease_design/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-deepmind%2Fnuclease_design/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260268635,"owners_count":22983601,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["engineering","learning","machine","protein"],"created_at":"2025-06-17T00:39:18.706Z","updated_at":"2025-06-17T00:39:20.143Z","avatar_url":"https://github.com/google-deepmind.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# ML-Guided Directed Evolution for Engineering a Better Nuclease Enzyme\n\nThis repository accompanies the paper: [Engineering highly active and diverse nuclease enzymes by combining machine learning and ultra-high-throughput screening](https://doi.org/10.1101/2024.03.21.585615)\n\n\u003cdiv style=\"width:70%; margin: auto;\"\u003e\n\u003cimg src=\"images/overview_wide.png\"\u003e\n\u003c/div\u003e\n\n## Analyzing our enzyme activity dataset\nYou can use our dataset of estimated enzyme activity for 55,760 NucB variants to develop new machine learning models or to generate new insights about NucB. \n\nA simple notebook has been provided to load and analyze the data: [\u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\"\u003e](https://colab.research.google.com/github/google-deepmind/nuclease_design/blob/main/notebooks/plot_landscape_analysis.ipynb\n)\n\n\n\u003cdiv style=\"width:70%; margin: auto;\"\u003e\n\u003cimg src=\"images/landscape.svg\" width=\"900\"\u003e\n\u003c/div\u003e\n\n## Reproducing the paper's analysis\n\nAll figures and tables in the paper can be reproduced by notebooks in [notebooks/](https://github.com/google-deepmind/nuclease_design/tree/main/notebooks). \n\nEach notebooks can be run as-is, since it loads pre-computed enrichment factor data from GCS (see below). To regenerate the analysis from the raw NGS count data, run\n[get_enrichment_factor_data.ipynb](https://github.com/google-deepmind/nuclease_design/tree/main/notebooks/get_enrichment_factor_data.ipynb)\nwith a local value of `LOCAL_OUTPUT_DATA_DIR`.\n\nThese notebooks, and the library code they call, can be used to dig deeper into our results or to provide a jumping-off point for creating your own genotype-phenotype dataset based on count data from high-throughput sorting.\n\n## Analyzing our libraries and models\nSome useful starting points:\n\n* Analyze the hit rates of various library design methods [\u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\"\u003e](https://colab.research.google.com/github/google-deepmind/nuclease_design/blob/main/notebooks/plot_hit_rates.ipynb\n)\n* Analyze the diversity of hits from these libraries [\u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\"\u003e](https://colab.research.google.com/github/google-deepmind/nuclease_design/blob/main/notebooks/plot_diversity.ipynb\n)\n* Play with the CNN model used for the final round of sequence design [\u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\"\u003e](https://colab.research.google.com/github/google-deepmind/nuclease_design/blob/main/notebooks/analyze_cnn.ipynb\n)\n\n## Data\nAll data is available in a Google cloud storage (GCS) [bucket](https://storage.googleapis.com/nuclease_design). We don't recommend directly downloading it; the above scripts use helper functions for loading from the bucket.\n\nThe bucket contains the following sub-directories:\n\n*   `raw_count_data`: raw NGS count data for pre-sort and post-sort populations.\n\n*   `processed_fiducial_data`: enrichment factors for synonyms of various\n    'fiducial' sequences. Each row represents a distinct DNA sequence that\n    translates to the same amino acid sequence.\n\n*   `processed_data`: enrichment factors computed from the raw count data and\n    the processed fiducial data. Each row represents a unique amino acid\n    sequence. For each row and each fiducial, the row is assigned a p-value for\n    observing its enrichment factor under the null distribution of enrichment\n    factors from the fiducial.\n\n*   `processed_data/landscape.csv`: A single file that merges data from all 4\n    rounds of experiments and provides a multi-class catalytic activity\n    labels for 56K distinct amino acid sequences.\n\n*   `plate_data`: Data from the low-throughput purified protein experiments used\n    to confirm hits.\n\n*   `library_designs`: A mapping from amino acid sequences to the list of the\n    names of the sub-libraries (corresponding to different sequence design\n    methods) that proposed it. Note that some sequences were proposed by\n    multiple methods.\n\n*   `analysis`: Data used for creating certain tables and results in the paper\n    that require expensive computations, such as clustering hits in order to\n    quantify diversity.\n\n*   `alignments`: A multiple sequence alignment used to fit our VAE model.\n\n## Running unit tests\n\nThe notebooks directly install this package from GitHub, so no installation is\nnecessary. However, you can locally install this package in order to run tests using the following commands:\n\nNote that our package requires **python \u003e= 3.10.**\n\n```\nvenv=/tmp/nuclease_design_venv\npython3 -m venv $venv\nsource $venv/bin/activate\npip install -e .\npython -m pytest nuclease_design/*test.py\n```\n\n\n## Citing this work\n\nPlease cite the accompanying [paper](https://doi.org/10.1101/2024.03.21.585615):\n```\n@article {thomasbelanger2024,\n\tauthor = {Neil Thomas and David Belanger and Chenling Xu and Hanson Lee and Kat Hirano and Kosuke Iwai and Vanja Polic and Kendra D Nyberg and Kevin Hoff and Lucas Frenz and Charlie A Emrich and Jun W Kim and Mariya Chavarha and Abi Ramanan and Jeremy J Agresti and Lucy J Colwell},\n\ttitle = {Engineering of highly active and diverse nuclease enzymes by combining machine learning and ultra-high-throughput screening},,\n\tyear = {2024},\n\tdoi = {10.1101/2024.03.21.585615},\n\tjournal = {bioRxiv}\n}\n```\n\n\n\n## License and disclaimer\n\nCopyright 2023 DeepMind Technologies Limited\n\nAll software is licensed under the Apache License, Version 2.0 (Apache 2.0);\nyou may not use this file except in compliance with the Apache 2.0 license.\nYou may obtain a copy of the Apache 2.0 license at:\nhttps://www.apache.org/licenses/LICENSE-2.0\n\nAll other materials are licensed under the Creative Commons Attribution 4.0\nInternational License (CC-BY). You may obtain a copy of the CC-BY license at:\nhttps://creativecommons.org/licenses/by/4.0/legalcode\n\nUnless required by applicable law or agreed to in writing, all software and\nmaterials distributed here under the Apache 2.0 or CC-BY licenses are\ndistributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,\neither express or implied. See the licenses for the specific language governing\npermissions and limitations under those licenses.\n\nThis is not an official Google product.\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-deepmind%2Fnuclease_design","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogle-deepmind%2Fnuclease_design","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-deepmind%2Fnuclease_design/lists"}