{"id":20801142,"url":"https://github.com/gcorso/neuroseed","last_synced_at":"2025-05-07T00:12:12.779Z","repository":{"id":37365036,"uuid":"362136079","full_name":"gcorso/NeuroSEED","owner":"gcorso","description":"Implementation of Neural Distance Embeddings for Biological Sequences (NeuroSEED) in PyTorch (NeurIPS 2021) ","archived":false,"fork":false,"pushed_at":"2023-10-14T01:15:09.000Z","size":1443,"stargazers_count":72,"open_issues_count":4,"forks_count":18,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-31T04:23:58.370Z","etag":null,"topics":["bioinformatics","biological-sequences","hierarchical-clustering","machine-learning","multiple-sequence-alignment","neurips-2021","pytorch"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2109.09740","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gcorso.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-04-27T14:11:39.000Z","updated_at":"2024-11-30T11:25:19.000Z","dependencies_parsed_at":"2022-07-09T07:01:12.174Z","dependency_job_id":null,"html_url":"https://github.com/gcorso/NeuroSEED","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gcorso%2FNeuroSEED","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gcorso%2FNeuroSEED/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gcorso%2FNeuroSEED/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gcorso%2FNeuroSEED/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gcorso","download_url":"https://codeload.github.com/gcorso/NeuroSEED/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252788532,"owners_count":21804284,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","biological-sequences","hierarchical-clustering","machine-learning","multiple-sequence-alignment","neurips-2021","pytorch"],"created_at":"2024-11-17T18:16:51.725Z","updated_at":"2025-05-07T00:12:12.750Z","avatar_url":"https://github.com/gcorso.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Neural Distance Embeddings for Biological Sequences\n\nOfficial implementation of Neural Distance Embeddings for Biological Sequences (NeuroSEED) in PyTorch published at NeurIPS 2021 ([preprint](https://arxiv.org/abs/2109.09740)). NeuroSEED is a novel framework to embed biological sequences in geometric vector spaces.\n\n![diagram](./tutorial/cover.png)\n\nNote: unfortunately due to my move between institutions the download scripts are broken and the files are no longer available on the original Drive. I have reuploaded them [here](https://drive.google.com/drive/folders/1tmXtsUV3MwxIDr-NB8Uk78IoCkBZtiu_?usp=sharing), but reach out if you believe there are some missing files.\n\n\n## Overview\n\nThe repository is organised in four main folders one for each of the tasks analysed. Each of these contain scripts and models used for the task as well as instructions on how to run them and the tuned hyperparameters found. \n\n- `edit_distance` for the *edit distance approximation* task\n- `closest_string` for the *closest string retrieval* task\n- `hierarchical_clustering` for the *hierarchical clustering* task, further divided in `relaxed` and `unsupervised` for the two approaches explored\n- `multiple_alignment` for the *multiple sequence alignment* task, further divided in `guide_tree` and `steiner_string`\n- `util` contains a series of utility routines shared between all the tasks\n- `tests` contains a wide range of tests for the various components of the repository \n\n## Installation\n\nCreate a virtual (or conda) environment and install the dependencies:\n\n```\npython3 -m venv neuroseed\nsource neuroseed/bin/activate\npip install -r requirements.txt\n```\n\nThen install the `mst` and `unionfind` packages used for the hierarchical clustering:\n\n```\ncd hierarchical_clustering/relaxed/mst; python setup.py build_ext --inplace; cd ../../..\ncd hierarchical_clustering/relaxed/unionfind; python setup.py build_ext --inplace; cd ../../..\n```\n\n## Reference\n\n```\n@article{corso2021neuroseed,\n  title={Neural Distance Embeddings for Biological Sequences},\n  author={Corso, Gabriele and Ying, Rex and P{\\'a}ndy, Michal and Veli{\\v{c}}kovi{\\'c}, Petar and Leskovec, Jure and Li{\\`o}, Pietro},\n  journal={Advances in Neural Information Processing Systems},\n  year={2021}\n}\n```\n\n\n## License\n\nMIT\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgcorso%2Fneuroseed","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgcorso%2Fneuroseed","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgcorso%2Fneuroseed/lists"}