{"id":15600939,"url":"https://github.com/lucidrains/progen","last_synced_at":"2025-08-19T14:32:36.796Z","repository":{"id":38790456,"uuid":"375389377","full_name":"lucidrains/progen","owner":"lucidrains","description":"Implementation and replication of ProGen, Language Modeling for Protein Generation, in Jax","archived":false,"fork":false,"pushed_at":"2021-09-08T20:28:10.000Z","size":209,"stargazers_count":111,"open_issues_count":3,"forks_count":17,"subscribers_count":8,"default_branch":"main","last_synced_at":"2024-12-09T18:11:46.041Z","etag":null,"topics":["artificial-intelligence","deep-learning","proteins"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lucidrains.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-06-09T14:44:17.000Z","updated_at":"2024-12-02T20:40:15.000Z","dependencies_parsed_at":"2022-07-09T13:30:37.445Z","dependency_job_id":null,"html_url":"https://github.com/lucidrains/progen","commit_stats":null,"previous_names":[],"tags_count":37,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fprogen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fprogen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fprogen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fprogen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lucidrains","download_url":"https://codeload.github.com/lucidrains/progen/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230359935,"owners_count":18214157,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","deep-learning","proteins"],"created_at":"2024-10-03T02:09:48.470Z","updated_at":"2024-12-19T01:06:32.275Z","avatar_url":"https://github.com/lucidrains.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## ProGen - (wip)\n\nImplementation and replication of \u003ca href=\"https://arxiv.org/abs/2004.03497\"\u003eProGen\u003c/a\u003e, Language Modeling for Protein Generation, in Pytorch and Jax (the weights will be made easily transferrable between the two). You can think of this as GPT for proteins sequences.\n\n## Requirements\n\nWe are going to use \u003ca href=\"https://github.com/python-poetry/poetry\"\u003ePoetry\u003c/a\u003e for managing the dependencies for this project. So first install it using the \u003ca href=\"https://github.com/python-poetry/poetry#osx--linux--bashonwindows-install-instructions\"\u003eone-liner bash command\u003c/a\u003e.\n\nNext, git clone the project and install the dependencies\n\n```bash\n$ git clone git@github.com:lucidrains/progen\n$ cd progen\n$ poetry install\n```\n\nFor training on GPUs, you may need to rerun pip install with the correct CUDA version. You can follow the instructions \u003ca href=\"https://github.com/google/jax#pip-installation-gpu-cuda\"\u003ehere\u003c/a\u003e\n\n\n```bash\n# ex. CUDA 11.1\n$ pip install --upgrade \"jax[cuda111]\" -f https://storage.googleapis.com/jax-releases/jax_releases.html\n```\n\nFor running any scripts, you'll notice that it will always be prepended with `poetry run`\n\n## Usage\n\n```python\nfrom jax import random\nfrom haiku import PRNGSequence\nfrom progen_transformer import ProGen\n\nmodel = ProGen(\n    num_tokens = 256,\n    dim = 512,\n    seq_len = 1024,\n    window_size = 256,       # local attention window size\n    depth = 12,              # depth\n    heads = 8,               # attention heads\n    dim_head = 64,           # dimension per head\n    ff_glu = True,           # use GLU in feedforward, from Noam's paper\n    global_mlp_depth = 2     # last N global gmlp layers\n)\n\nrng = PRNGSequence(42)\nseq = random.randint(next(rng), (1024,), 0, 256)\n\nparams = model.init(next(rng), seq)\nlogits = model.apply(params, next(rng), seq) # (1024, 256)\n```\n\n## Training\n\nDownload Uniref50 from \u003ca href=\"https://www.uniprot.org/downloads\"\u003eUniProt\u003c/a\u003e and place `uniref50.fasta` in the root directory\n\n```bash\n$ poetry run python generate_data.py\n```\n\nYou should see a lot of green if everything succeeds. Then\n\n\n```bash\n$ poetry run python train.py\n```\n\nBy default, the script will checkpoint and resume automatically, but if you wish to clear your progress and restart, just add a `--new` flag\n\n```bash\n$ poetry run python train.py --new\n```\n\nModel checkpoints will be saved periodically to `./ckpts`\n\nFinally, to sample from your checkpoint, just do\n\n```bash\n$ poetry run python sample.py\n```\n\nYou can pass a prime with `--prime`. You can either pass the annotations, followed by `#`, to get the generated sequence, or pass the sequence (also followed by `#`) and get the generated annotations\n\n```bash\n$ poetry run python sample.py --prime \"[Tax=Mammalia] #\"\n```\n\n## Mixed Precision\n\nTo use mixed precision training, you'll need to install the latest Haiku with the following command\n\n```bash\n$ pip install git+https://github.com/deepmind/dm-haiku\n```\n\nThen make sure to set the `--mixed_precision` flag when invoking the training script\n\n```bash\n$ poetry run python train.py --mixed_precision\n```\n\n## Todo\n\n- [ ] model parallelism with pjit\n- [ ] join in GO annotations with pandas dataframe\n- [ ] setup annotation -\u003e template string system, all configuration driven, find easy way to test. offer two types of annotations, one parsed from uniref descriptions, the other from GO annotation presence\n- [ ] add multiple data sources (check out trembl)\n- [ ] when sampling, prime with entire sequence prior to the pound sign (intersection of sequence and annotation)\n- [ ] utilize all cores when processing data\n- [ ] save all training settings in the checkpoints too\n- [x] bfloat16 on xla\n- [x] resume from correct place in tfrecord even if batch size is changed inbetween runs, display number of sequences processed\n- [x] train compressed gzip tfrecords from google cloud storage path\n- [x] remove tfrecord package and just use tfrecordwriter with gzip\n- [x] generate validation tfrecords\n- [x] checkpoint and resume from a google cloud storage path\n- [x] use jinja2 for wandb html sample logging\n- [x] manage experimental tracker state, and also allow ability to turn it off by piping to noop\n- [x] add a confirmation before clearing a folder for --new run\n- [x] engineer mask in cross entropy loss so that padding can be reused as end-of-string token\n- [x] flip seq # annotation order with prob set in config\n- [x] keep N last checkpoints\n\n## Acknowledgements\n\nMany thanks goes out to \u003ca href=\"https://github.com/kingoflolz\"\u003eBen Wang\u003c/a\u003e, who showed this type of large-scale training can be achieved with \u003ca href=\"https://github.com/kingoflolz/mesh-transformer-jax\"\u003eGPT-J\u003c/a\u003e\n\n## Citations\n\n```bibtex\n@misc{madani2020progen,\n    title   = {ProGen: Language Modeling for Protein Generation}, \n    author  = {Ali Madani and Bryan McCann and Nikhil Naik and Nitish Shirish Keskar and Namrata Anand and Raphael R. Eguchi and Po-Ssu Huang and Richard Socher},\n    year    = {2020},\n    eprint  = {2004.03497},\n    archivePrefix = {arXiv},\n    primaryClass = {q-bio.BM}\n}\n```\n\n```bibtex\n@misc{su2021roformer,\n    title   = {RoFormer: Enhanced Transformer with Rotary Position Embedding},\n    author  = {Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu},\n    year    = {2021},\n    eprint  = {2104.09864},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CL}\n}\n```\n\n```bibtex\n@misc{shazeer2020glu,\n    title   = {GLU Variants Improve Transformer},\n    author  = {Noam Shazeer},\n    year    = {2020},\n    url     = {https://arxiv.org/abs/2002.05202}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Fprogen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flucidrains%2Fprogen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Fprogen/lists"}