https://github.com/xrsrke/progen
Generating new proteins using language models
https://github.com/xrsrke/progen
Last synced: about 2 months ago
JSON representation
Generating new proteins using language models
- Host: GitHub
- URL: https://github.com/xrsrke/progen
- Owner: xrsrke
- License: mit
- Created: 2023-02-03T06:02:30.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-02-25T03:49:38.000Z (over 2 years ago)
- Last Synced: 2025-03-27T11:13:55.832Z (2 months ago)
- Language: Jupyter Notebook
- Homepage:
- Size: 3.93 MB
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
ProGen - 🚧 WORK IN PROGRESS 🚧
================
Paper: [Large language models generate functional protein sequences
across diverse
families](https://www.nature.com/articles/s41587-022-01618-2)I am currently working towards this paper. Check out my learning
progress here: https://twitter.com/xariusrke/status/1621403313651728386### TODO
- \[DONE\] fast2dict
- Extract multiple properties of a protein sequence
- Tokenize the control tag
- Generate training data from dataset### Questions
**Tokenizing** - Does the control tag use the same tokenizer as protein
sequence? - A control tag is represented by a token or multiple tokens?
If multiple tokens, then conditioned them all? - How to separate each
control tag? - Can the model predict the next control tag?**Training**
### Resources
I implemented `ProGen` using these resources