https://github.com/xrsrke/progen

Generating new proteins using language models
https://github.com/xrsrke/progen

Last synced: about 2 months ago
JSON representation

Generating new proteins using language models

Host: GitHub
URL: https://github.com/xrsrke/progen
Owner: xrsrke
License: mit
Created: 2023-02-03T06:02:30.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-02-25T03:49:38.000Z (over 2 years ago)
Last Synced: 2025-03-27T11:13:55.832Z (2 months ago)
Language: Jupyter Notebook
Homepage:
Size: 3.93 MB
Stars: 4
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        ProGen - 🚧 WORK IN PROGRESS 🚧

================

![image.png](index_files/figure-commonmark/824a0f6d-1-image.png)

Paper: [Large language models generate functional protein sequences

across diverse

families](https://www.nature.com/articles/s41587-022-01618-2)

I am currently working towards this paper. Check out my learning

progress here: https://twitter.com/xariusrke/status/1621403313651728386

### TODO

- \[DONE\] fast2dict

- Extract multiple properties of a protein sequence

- Tokenize the control tag

- Generate training data from dataset

### Questions

**Tokenizing** - Does the control tag use the same tokenizer as protein

sequence? - A control tag is represented by a token or multiple tokens?

If multiple tokens, then conditioned them all? - How to separate each

control tag? - Can the model predict the next control tag?

**Training**

### Resources

I implemented `ProGen` using these resources

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/xrsrke/progen

Awesome Lists containing this project

README