{"id":18598224,"url":"https://github.com/cmdoret/clop","last_synced_at":"2025-05-05T21:13:00.088Z","repository":{"id":210464524,"uuid":"726617122","full_name":"cmdoret/clop","owner":"cmdoret","description":null,"archived":false,"fork":false,"pushed_at":"2023-12-04T12:32:30.000Z","size":203,"stargazers_count":4,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-05T21:12:34.514Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cmdoret.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-12-02T22:07:30.000Z","updated_at":"2024-04-06T00:00:26.000Z","dependencies_parsed_at":"2023-12-02T23:22:39.735Z","dependency_job_id":"1a6fdf49-e9d4-40a0-a319-1fcd54d81610","html_url":"https://github.com/cmdoret/clop","commit_stats":null,"previous_names":["cmdoret/clop"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cmdoret%2Fclop","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cmdoret%2Fclop/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cmdoret%2Fclop/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cmdoret%2Fclop/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cmdoret","download_url":"https://codeload.github.com/cmdoret/clop/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252577022,"owners_count":21770721,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-07T01:31:41.554Z","updated_at":"2025-05-05T21:13:00.054Z","avatar_url":"https://github.com/cmdoret.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CLOP: Contrastive Language-Omics Pre-training\n\n## Project description\n\nCLOP aims to provide a shared embedding for omics (DNA, RNA, protein) sequences and their functions which can be used to perform downstream analysis at high speed.\n\nIt is based on the CLIP architecture, which jointly trains an image transformer and a text transformer to project respectively pictures and captions into the same embedding space.\n\nIn CLOP, we use [Frequency Chaos Game Representation](https://www.sciencedirect.com/science/article/pii/S2001037021004736) to represent DNA sequences as a \"fingerprint\" image of fixed dimension.\n\nThis transformation allows us to work with sequences of very different lengths without limitations related to context window.\n\nWe directly fine-tune the CLIP transformers using these DNA images and function texts.\n\n## Status\n\nThe fine-tuning of the model could not be done in time, there are 2 wip demos:\n* A telegram bot is available to return the image representation of input DNA sequences: https://t.me/clip_clop_bot\n* A mock interface on GitHub pages to propose related functions to an input sequence: https://baudrly.github.io/clop/\n\n\n## Use cases\n\nThe shared embedding can be used directly for various downstream genomic analysis, such as predicting the function of an input sequence, finding closely related sequences with similar functions, or for zero shot classification of DNA sequences (e.g. to detect contaminating sequences).\n\n```mermaid\n\ngraph LR\n\n    subgraph func[Function prediction]\n        CLOPFUN[CLOP]\n    end\n    subgraph fuzz[Fuzzy matching]\n        CLOPFUZ[CLOP]\n        MATCH[\"🧬🧬🧬\"]\n    end\n    subgraph zero[Zero shot classification]\n        CLOPZERO[CLOP]\n    end\n  AFUN[\"🧬\"] --\u003e|embed| CLOPFUN\n  CLOPFUN --\u003e|closest texts| FUN[\"Antibiotic resistance\\nAntibiotic degradation\"]\n  AFUZ[\"🧬\"] --\u003e|embed| CLOPFUZ\n  CLOPFUZ --\u003e|closest dna| MATCH\n  AZER[\"🧬\"] --\u003e|embed| CLOPZERO\n  DOL[\"🐬\"] --\u003e|embed| CLOPZERO\n  BAC[\"🦠\"] --\u003e|embed| CLOPZERO\n  CLOPZERO --\u003e |similarity| DOLSIM[\"🐬, 🧬\"]\n  CLOPZERO --\u003e |similarity| BACSIM[\"🦠, 🧬\"]\n  BACSIM --\u003e MAX\n  DOLSIM --\u003e MAX\n  MAX --\u003e SELECT[\"🦠\"]\n\n```\n\n## Training data\n\nFor this demo, we restricted the training set to human transcript sequences (version GRCh38) and their functional annotations, available to download from https://www.ncbi.nlm.nih.gov/genome/guide/human/\n\nWe further subsampled 50,000 sequence-annotation pairs for the fine-tuning experiment.\n\n## Acknowledgement\n\nThis project originated at the 2023 SDSC-hackathon on Generative AI. It was initiated by the team Swiss-Androsace (see members in the [LICENSE](./LICENSE) copyright notice).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcmdoret%2Fclop","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcmdoret%2Fclop","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcmdoret%2Fclop/lists"}