Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/maris205/find_central-dogma_again

Can a linguist discover the central dogma?LLM say yes!
https://github.com/maris205/find_central-dogma_again

Last synced: 5 days ago
JSON representation

Can a linguist discover the central dogma?LLM say yes!

Host: GitHub
URL: https://github.com/maris205/find_central-dogma_again
Owner: maris205
License: apache-2.0
Created: 2024-12-05T02:49:00.000Z (21 days ago)
Default Branch: main
Last Pushed: 2024-12-10T17:07:31.000Z (15 days ago)
Last Synced: 2024-12-10T18:35:47.454Z (15 days ago)
Language: Jupyter Notebook
Size: 3.53 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Find central dogma again

Multilingual transfer ability, which reflects how well models fine-tuned on one source

language can be applied to other languages, has been well studied in multilingual pre-trained

models. However, the existence of such capability transfer between natural language and gene

sequences/languages remains underexplored.This study addresses this gap by drawing inspiration

from the sentence-pair classification task used for evaluating sentence similarity in natural

language. We constructed two analogous tasks: DNA-pair classification(DNA sequence similarity)

and DNA-protein-pair classification(gene coding determination). These tasks were designed to

validate the transferability of capabilities from natural language to gene sequences. Even a

small-scale pre-trained model like GPT-2-small, which was pre-trained on English, achieved an

accuracy of 78% on the DNA-pair classification task after being fine-tuned on English

sentence-pair classification data(XTREME PAWS-X). While training a BERT model on

multilingual text, the precision reached 82%.On the more complex DNA-protein-pair

classification task, however, the model's output was barely distinguishable from random

output.Experiments suggest that there may be a capability transfer from natural language to

genetic language, but further task testing is needed to confirm this.

# experiment result

| base model     | pretrain | finetune | test-en | test-fr | test-de | test-zh | test-dna |

|----------------|----------|----------|---------|---------|---------|---------|----------|

| gpt2-small     | en       | en       | 0.92    | 0.74    | 0.73    | 0.61    | **0.78** |

| gpt2-medium    | en       | en       | 0.92    | 0.80    | 0.76    | 0.62    | 0.55     |

| gpt2-large     | en       | en       | 0.94    | 0.81    | 0.79    | 0.66    | 0.63     |

| bert           | en       | en       | 0.91    | 0.77    | 0.73    | 0.52    | 0.54     |

| bert           | multilan | en       | 0.94    | 0.86    | 0.83    | 0.77    | **0.82** |

| gpt2-small-1   | en+DNA   | en       | 0.90    | 0.74    | 0.72    | 0.59    | 0.48     |

| gpt2-small-2   | en+DNA   | en       | 0.76    | 0.59    | 0.60    | 0.56    | 0.60     |

* dna_150.json, dna pair data

* dna_protein_150.json, dna protein pair data

* gpt2_small_pretrain_en_finetune_en.ipynb , code for gpt2 small

* gpt2_medium_pretrain_en_finetune_en.ipynb, code for gpt2 medium

* gpt2_large_pretrain_en_finetune_en.ipynb, code for gpt2 large

* bert_pretrain_en_finetune_en.ipynb, code for bert base

* bert_multi_pretrain_en_finetune_en.ipynb, code for bert multi language

* gpt2_small_pretrain_en_dna_finetune_en.ipynb, code for gpt2-small-2

# paper

```

@misc{liang2024linguistsbetterunderstanddna,

      title={Can linguists better understand DNA?}, 

      author={Wang Liang},

      year={2024},

      eprint={2412.07678},

      archivePrefix={arXiv},

      primaryClass={cs.CL},

      url={https://arxiv.org/abs/2412.07678}, 

}

```