{"id":13751514,"url":"https://github.com/evo-design/evo","last_synced_at":"2025-05-09T18:31:33.136Z","repository":{"id":224778093,"uuid":"759144653","full_name":"evo-design/evo","owner":"evo-design","description":"Biological foundation modeling from molecular to genome scale","archived":false,"fork":false,"pushed_at":"2025-02-26T01:43:15.000Z","size":9029,"stargazers_count":1363,"open_issues_count":38,"forks_count":165,"subscribers_count":26,"default_branch":"main","last_synced_at":"2025-04-22T04:50:47.399Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/evo-design.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-17T19:11:33.000Z","updated_at":"2025-04-21T01:44:57.000Z","dependencies_parsed_at":"2024-04-28T04:37:12.719Z","dependency_job_id":"038b69b1-bd1f-4cab-9d4e-7ecdf799e6c9","html_url":"https://github.com/evo-design/evo","commit_stats":null,"previous_names":["evo-design/evo"],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evo-design%2Fevo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evo-design%2Fevo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evo-design%2Fevo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evo-design%2Fevo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/evo-design","download_url":"https://codeload.github.com/evo-design/evo/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253303024,"owners_count":21886873,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:00:47.204Z","updated_at":"2025-05-09T18:31:33.078Z","avatar_url":"https://github.com/evo-design.png","language":"Jupyter Notebook","funding_links":[],"categories":["Machine Learning Tasks and Models","Ranked by starred repositories"],"sub_categories":["Foundation Models"],"readme":"# Evo: DNA foundation modeling from molecular to genome scale\n\n**We have developed a new model called Evo 2 that extends the Evo 1 model and its ideas to all domains of life. Please see [https://github.com/arcinstitute/evo2](https://github.com/arcinstitute/evo2) for more details.**\n\n![Evo](evo.jpg)\n\nEvo is a biological foundation model capable of long-context modeling and design.\nEvo uses the [StripedHyena architecture](https://github.com/togethercomputer/stripedhyena) to enable modeling of sequences at a single-nucleotide, byte-level resolution with near-linear scaling of compute and memory relative to context length.\nEvo has 7 billion parameters and is trained on [OpenGenome](https://huggingface.co/datasets/LongSafari/open-genome), a prokaryotic whole-genome dataset containing ~300 billion tokens.\n\nWe describe Evo in the paper [“Sequence modeling and design from molecular to genome scale with Evo”](https://www.science.org/doi/10.1126/science.ado9336).\n\nWe describe Evo 1.5 in the paper [“Semantic mining of functional _de novo_ genes from a genomic language model”](https://www.biorxiv.org/content/10.1101/2024.12.17.628962). We used the Evo 1.5 model to generate [SynGenome](https://evodesign.org/syngenome/), the first AI-generated genomics database containing over 100 billion base pairs of synthetic DNA sequences.\n\nWe provide the following model checkpoints:\n| Checkpoint Name                        | Description |\n|----------------------------------------|-------------|\n| `evo-1.5-8k-base`   | A model pretrained with 8,192 context obtained by extending the pretraining of `evo-1-8k-base` to process 50% more training data. |\n| `evo-1-8k-base`     | A model pretrained with 8,192 context. We use this model as the base model for molecular-scale finetuning tasks. |\n| `evo-1-131k-base`   | A model pretrained with 131,072 context using `evo-1-8k-base` as the base model. We use this model to reason about and generate sequences at the genome scale. |\n| `evo-1-8k-crispr`   | A model finetuned using `evo-1-8k-base` as the base model to generate CRISPR-Cas systems. |\n| `evo-1-8k-transposon`   | A model finetuned using `evo-1-8k-base` as the base model to generate IS200/IS605 transposons. |\n\n## News\n\n**December 17, 2024:** We have found and fixed a bug in the code for Evo model inference affecting package versions from Nov 15-Dec 16, 2024, which has been corrected in release versions 0.3 and above. If you installed the package during this timeframe, please upgrade to correct the issue.\n\n## Contents\n\n- [Setup](#setup)\n  - [Requirements](#requirements)\n  - [Installation](#installation)\n- [Usage](#usage)\n- [HuggingFace](#huggingface)\n- [Together API](#together-api)\n- [colab](https://colab.research.google.com/github/evo-design/evo/blob/main/scripts/hello_evo.ipynb)\n- [Playground wrapper](https://evo.nitro.bio/)\n- [Dataset](#dataset)\n- [Citation](#citation)\n\n## Setup\n\n### Requirements\n\nEvo is based on [StripedHyena](https://github.com/togethercomputer/stripedhyena/tree/main).\n\nEvo uses [FlashAttention-2](https://github.com/Dao-AILab/flash-attention), which may not work on all GPU architectures.\nPlease consult the [FlashAttention GitHub repository](https://github.com/Dao-AILab/flash-attention#installation-and-features) for the current list of supported GPUs.\n\nMake sure to install the correct [PyTorch version](https://pytorch.org/) on your system.\n\n### Installation\n\nYou can install Evo using `pip`\n```bash\npip install evo-model\n```\nor directly from the GitHub source\n```bash\ngit clone https://github.com/evo-design/evo.git\ncd evo/\npip install .\n```\n\nWe recommend that you install the PyTorch library first, before installing all other dependencies (due to dependency issues of the `flash-attn` library; see, e.g., this [issue](https://github.com/Dao-AILab/flash-attention/issues/246)).\n\nOne of our [example scripts](scripts/), demonstrating how to go from generating sequences with Evo to folding proteins ([scripts/generation_to_folding.py](scripts/generation_to_folding.py)), further requires the installation of `prodigal`. We have created an [environment.yml](environment.yml) file for this:\n\n```bash\nconda env create -f environment.yml\nconda activate evo-design\n```\n\n## Usage\n\nBelow is an example of how to download Evo and use it locally through the Python API.\n```python\nfrom evo import Evo\nimport torch\n\ndevice = 'cuda:0'\n\nevo_model = Evo('evo-1-131k-base')\nmodel, tokenizer = evo_model.model, evo_model.tokenizer\nmodel.to(device)\nmodel.eval()\n\nsequence = 'ACGT'\ninput_ids = torch.tensor(\n    tokenizer.tokenize(sequence),\n    dtype=torch.int,\n).to(device).unsqueeze(0)\n\nwith torch.no_grad():\n    logits, _ = model(input_ids) # (batch, length, vocab)\n\nprint('Logits: ', logits)\nprint('Shape (batch, length, vocab): ', logits.shape)\n```\nAn example of batched inference can be found in [`scripts/example_inference.py`](scripts/example_inference.py).\n\nWe provide an [example script](scripts/generate.py) for how to prompt the model and sample a set of sequences given the prompt.\n```bash\npython -m scripts.generate \\\n    --model-name 'evo-1-131k-base' \\\n    --prompt ACGT \\\n    --n-samples 10 \\\n    --n-tokens 100 \\\n    --temperature 1. \\\n    --top-k 4 \\\n    --device cuda:0\n```\n\nWe also provide an [example script](scripts/score.py) for using the model to score the log-likelihoods of a set of sequences.\n```bash\npython -m scripts.score \\\n    --input-fasta examples/example_seqs.fasta \\\n    --output-tsv scores.tsv \\\n    --model-name 'evo-1-131k-base' \\\n    --device cuda:0\n```\n\n## HuggingFace\n\nEvo is integrated with [HuggingFace](https://huggingface.co/togethercomputer/evo-1-131k-base).\n```python\nfrom transformers import AutoConfig, AutoModelForCausalLM\n\nmodel_name = 'togethercomputer/evo-1-8k-base'\n\nmodel_config = AutoConfig.from_pretrained(model_name, trust_remote_code=True, revision=\"1.1_fix\")\nmodel_config.use_cache = True\n\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_name,\n    config=model_config,\n    trust_remote_code=True,\n    revision=\"1.1_fix\"\n)\n```\n\n\n## Together API\n\nEvo is available through Together AI with a [web UI](https://api.together.xyz/playground/language/togethercomputer/evo-1-131k-base), where you can generate DNA sequences with a chat-like interface.\n\nFor more detailed or batch workflows, you can call the Together API with a simple example below.\n\n\n```python\nimport openai\nimport os\n\n# Fill in your API information here.\nclient = openai.OpenAI(\n  api_key=TOGETHER_API_KEY,\n  base_url='https://api.together.xyz',\n)\n\nchat_completion = client.chat.completions.create(\n  messages=[\n    {\n      \"role\": \"system\",\n      \"content\": \"\"\n    },\n    {\n      \"role\": \"user\",\n      \"content\": \"ACGT\", # Prompt the model with a sequence.\n    }\n  ],\n  model=\"togethercomputer/evo-1-131k-base\",\n  max_tokens=128, # Sample some number of new tokens.\n  logprobs=True\n)\nprint(\n    chat_completion.choices[0].logprobs.token_logprobs,\n    chat_completion.choices[0].message.content\n)\n```\n\n## Dataset\n\nThe OpenGenome dataset for pretraining Evo is available at [Hugging Face datasets](https://huggingface.co/datasets/LongSafari/open-genome).\n\n## Citation\n\nPlease cite the following publication when referencing Evo.\n\n```\n@article{nguyen2024sequence,\n   author = {Eric Nguyen and Michael Poli and Matthew G. Durrant and Brian Kang and Dhruva Katrekar and David B. Li and Liam J. Bartie and Armin W. Thomas and Samuel H. King and Garyk Brixi and Jeremy Sullivan and Madelena Y. Ng and Ashley Lewis and Aaron Lou and Stefano Ermon and Stephen A. Baccus and Tina Hernandez-Boussard and Christopher Ré and Patrick D. Hsu and Brian L. Hie },\n   title = {Sequence modeling and design from molecular to genome scale with Evo},\n   journal = {Science},\n   volume = {386},\n   number = {6723},\n   pages = {eado9336},\n   year = {2024},\n   doi = {10.1126/science.ado9336},\n   URL = {https://www.science.org/doi/abs/10.1126/science.ado9336},\n}\n```\n\nPlease cite the following publication when referencing Evo 1.5.\n\n```\n@article {merchant2024semantic,\n   author = {Merchant, Aditi T and King, Samuel H and Nguyen, Eric and Hie, Brian L},\n   title = {Semantic mining of functional de novo genes from a genomic language model},\n   year = {2024},\n   doi = {10.1101/2024.12.17.628962},\n   publisher = {Cold Spring Harbor Laboratory},\n   URL = {https://www.biorxiv.org/content/early/2024/12/18/2024.12.17.628962},\n   journal = {bioRxiv}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fevo-design%2Fevo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fevo-design%2Fevo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fevo-design%2Fevo/lists"}