https://github.com/cmdoret/clop

Last synced: about 1 year ago
JSON representation

Host: GitHub
URL: https://github.com/cmdoret/clop
Owner: cmdoret
License: mit
Created: 2023-12-02T22:07:30.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-12-04T12:32:30.000Z (over 2 years ago)
Last Synced: 2025-05-05T21:12:34.514Z (about 1 year ago)
Language: Jupyter Notebook
Size: 198 KB
Stars: 4
Watchers: 1
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # CLOP: Contrastive Language-Omics Pre-training

## Project description

CLOP aims to provide a shared embedding for omics (DNA, RNA, protein) sequences and their functions which can be used to perform downstream analysis at high speed.

It is based on the CLIP architecture, which jointly trains an image transformer and a text transformer to project respectively pictures and captions into the same embedding space.

In CLOP, we use [Frequency Chaos Game Representation](https://www.sciencedirect.com/science/article/pii/S2001037021004736) to represent DNA sequences as a "fingerprint" image of fixed dimension.

This transformation allows us to work with sequences of very different lengths without limitations related to context window.

We directly fine-tune the CLIP transformers using these DNA images and function texts.

## Status

The fine-tuning of the model could not be done in time, there are 2 wip demos:

* A telegram bot is available to return the image representation of input DNA sequences: https://t.me/clip_clop_bot

* A mock interface on GitHub pages to propose related functions to an input sequence: https://baudrly.github.io/clop/

## Use cases

The shared embedding can be used directly for various downstream genomic analysis, such as predicting the function of an input sequence, finding closely related sequences with similar functions, or for zero shot classification of DNA sequences (e.g. to detect contaminating sequences).

```mermaid

graph LR

    subgraph func[Function prediction]

        CLOPFUN[CLOP]

    end

    subgraph fuzz[Fuzzy matching]

        CLOPFUZ[CLOP]

        MATCH["🧬🧬🧬"]

    end

    subgraph zero[Zero shot classification]

        CLOPZERO[CLOP]

    end

  AFUN["🧬"] -->|embed| CLOPFUN

  CLOPFUN -->|closest texts| FUN["Antibiotic resistance\nAntibiotic degradation"]

  AFUZ["🧬"] -->|embed| CLOPFUZ

  CLOPFUZ -->|closest dna| MATCH

  AZER["🧬"] -->|embed| CLOPZERO

  DOL["🐬"] -->|embed| CLOPZERO

  BAC["🦠"] -->|embed| CLOPZERO

  CLOPZERO --> |similarity| DOLSIM["🐬, 🧬"]

  CLOPZERO --> |similarity| BACSIM["🦠, 🧬"]

  BACSIM --> MAX

  DOLSIM --> MAX

  MAX --> SELECT["🦠"]

```

## Training data

For this demo, we restricted the training set to human transcript sequences (version GRCh38) and their functional annotations, available to download from https://www.ncbi.nlm.nih.gov/genome/guide/human/

We further subsampled 50,000 sequence-annotation pairs for the fine-tuning experiment.

## Acknowledgement

This project originated at the 2023 SDSC-hackathon on Generative AI. It was initiated by the team Swiss-Androsace (see members in the [LICENSE](./LICENSE) copyright notice).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/cmdoret/clop

Awesome Lists containing this project

README