{"id":19682741,"url":"https://github.com/ml-jku/clamp","last_synced_at":"2025-06-20T01:05:06.236Z","repository":{"id":109347682,"uuid":"591987874","full_name":"ml-jku/clamp","owner":"ml-jku","description":"Code for the paper Enhancing Activity Prediction Models in Drug Discovery with the Ability to Understand Human Language","archived":false,"fork":false,"pushed_at":"2024-09-13T18:03:41.000Z","size":1182,"stargazers_count":98,"open_issues_count":0,"forks_count":6,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-06-20T01:04:48.707Z","etag":null,"topics":["assay-modeling","cheminformatics","contrastive-learning","drug-discovery","machine-learning","qsar","zero-shot"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2303.03363","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ml-jku.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-01-22T15:27:29.000Z","updated_at":"2025-05-14T08:15:59.000Z","dependencies_parsed_at":"2023-05-25T04:30:41.825Z","dependency_job_id":"3a544610-e1ec-4c4c-839d-5db5a4082c9b","html_url":"https://github.com/ml-jku/clamp","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ml-jku/clamp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-jku%2Fclamp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-jku%2Fclamp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-jku%2Fclamp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-jku%2Fclamp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ml-jku","download_url":"https://codeload.github.com/ml-jku/clamp/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-jku%2Fclamp/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260857364,"owners_count":23073435,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["assay-modeling","cheminformatics","contrastive-learning","drug-discovery","machine-learning","qsar","zero-shot"],"created_at":"2024-11-11T18:11:58.337Z","updated_at":"2025-06-20T01:05:01.220Z","avatar_url":"https://github.com/ml-jku.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# :clamp: CLAMP\n\n[![arXiv](https://img.shields.io/badge/arXiv-2303.03363-b31b1b.svg)](https://arxiv.org/abs/2303.03363)\n[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ml-jku/clamp/blob/main/notebooks/CLAMP_colab_demo.ipynb)\n\nCLAMP (Contrastive Language-Assay Molecule Pre-Training) is trained on molecule-bioassay pairs. It can be instructed in natural language to predict the most relevant molecule, given a textual description of a bioassay, without training samples. In extensive experiments, our method yields improved predictive performance on few-shot learning benchmarks and zero-shot problems in drug discovery. \n\n## Approach\n\n![CLAMP](./data/figs/clamp.png)\n\n## :rocket: Updates\n\n- 04/24: create augmentations for assay-descriptions using [assay_augment.py](./clamp/dataset/assay_augment.py) \n- 11/23: Pretrained Model weights for Frequent Hitter (FH), a strong baseline for few- and zero-shot drug discovery. Use it running `fh_model = clamp.FH(device='cpu')`.\n- 10/23: PubChem23, a new version of the PubChem-dataset with \u003e~500k assays, in a preprocessed form is available (see [./data/pubchem.md](./data/pubchem.md))\n\n## :gear: Setup Environment\n\nWhen using `conda`, an environment can be set up using\n```bash\nconda env create -f env.yml\nconda activate clamp_env\n```\nTo activate the environment call ```conda activate clamp_env```.\nYou may need to adjust the CUDA version.\n\nAnother option is:\n```bash\npip install -e git+https://github.com/ml-jku/clamp.git\n```\n\n## :fire: Use a pretrained CLAMP model\n\nWarning: Currently only one version is available. We will update this repo with new pretrained models.\n\n```python\nimport torch\nimport clamp\n\nmodel = clamp.CLAMP(device='cpu')\nmodel.eval()\n\nmolecules = [\n    'CCOP(=O)(Nc1cccc(Cl)c1)OCC', #inactive\n    'O=C(O)c1ccccc1O', #inactive\n    'NNP(=S)(NN)c1ccccc1', #active\n    'CC(=O)OC1=CC=CC=C1C(=O)O', # Aspirin\n    ]\nassay_descriptions = [\n    'HIV: Experimentally measured abilities to inhibit HIV replication.',\n    ]\n\nwith torch.no_grad():\n    logits = model.forward_dense(molecules, assay_descriptions)\n    probs = logits.softmax(dim=0).cpu().numpy() # probs for molecules\n\nprint(\"Mol probs for assay:\", probs[:,0]) # res: [0.258 0.235 0.269  0.236]\n```\n\n\n## :lab_coat: Reproduce\n\n### Setup FS-Mol\nFor the [preprocessed FS-Mol dataset](https://cloud.ml.jku.at/s/dCjrt9c4arbz6rF/download) used in the paper run the following commands, which downloads, unzips and deletes the zip-file from your clamp directory:\n```bash\nwget -N -r https://cloud.ml.jku.at/s/dCjrt9c4arbz6rF/download -O fsmol.zip\nunzip fsmol.zip; rm fsmol.zip\n```\n\nTo download an preprocess from the original source:\n```python clamp/dataset/prep_fsmol.py --data_dir=./data/fsmol/```\n\nTo compute the compound encodings as input for your model run\n```bash\npython clamp/dataset/encode_compound.py \\\n--compounds=./data/fsmol/compound_names.parquet \\\n--compound2smiles=./data/fsmol/compound_smiles.parquet \\\n--fp_type=morganc+rdkc --fp_size=8096\n```\n\nTo compute the assay encodings as input for your model run\n```bash\npython clamp/dataset/encode_assay.py --assay_path=./data/fsmol/assay_names.parquet --encoding=clip --gpu=0 --columns \\\nassay_type_description description assay_category assay_cell_type assay_chembl_id assay_classifications assay_organism assay_parameters assay_strain assay_subcellular_fraction assay_tax_id assay_test_type assay_tissue assay_type bao_format bao_label cell_chembl_id confidence_description confidence_score document_chembl_id relationship_description relationship_type src_assay_id src_id target_chembl_id tissue_chembl_id variant_sequence \\\n--suffix=all\n```\nor use ```--encoding=lsa```.\n\n### Setup PubChem\n\nfor the [version used in the paper](https://cloud.ml.jku.at/s/2ybfLRXWSYb4DZN/download) as well as to generate an up-to-date version see ```./data/pubchem.md```\n\n## :fire: Train your own model\n\nRun (adjust hparams by adding it as command or in the file ```./hparams/default.json```)\n```bash\npython clamp/train.py --dataset=./data/fsmol --assay_mode=clip --split=FSMOL_split\n```\n\nThis should result in a model with a zero-shot $\\text{AUROC}$ of $0.70$ and $\\Delta \\text{AP}$ of $0.19$ on the test-set.\n\n## Evaluate a pretrained CLAMP model\n\nNote alterations in the exact split, as well as in the pretraining (droped MoleculeNet molecules)\n\nto compute the clip assay-features run:\n```\npython clamp/dataset/encode_assay.py --assay_path=./data/pubchem18/assay_names.parquet --encoding=clip --gpu=0 --columns title\n```\nand for the compound-features:\n```\npython clamp/dataset/encode_compound.py --compound2smiles=./data/pubchem18/compound_smiles.parquet --compounds=./data/pubchem18/compound_names.parquet --fp_type=morganc+rdkc --fp_size=8192\n```\n\nNow you can use the pretrained CLAMP model:\n```bash\npython clamp/train.py --model=PretrainedCLAMP --dataset=./data/pubchem18 --assay_mode=clip --split=time_a --epoch_max=0\n```\n(Warning about checkpoint can be ignored) \nThis should return $\\Delta \\text{AP}$ of $0.13$ on the test-set.\n\n## Downstream Evaluation:\n### Setup MoleculeNet\n\nTo download the preprocessed downstream datasets call\n```bash\nwget -N -r https://cloud.ml.jku.at/s/pyJMm4yQeWFM2gG/download -O downstream.zip\nunzip downstream.zip; rm downstream.zip\n```\n\nTo download an preprocess the downstream datasets from the source call.\n```python clamp/dataset/prep_moleculenet.py```\n(Doesn't include Tox21-10k)\n\n## :test_tube: Linear Probing\nGet a clamp-encoding\n```\npython clamp/dataset/encode_compound.py --compound2smiles=./data/moleculenet/tox21/compound_smiles.parquet --fp_type=clamp\n```\nRun linear probing on this encoding\n```\npython clamp/linear_probe.py ./data/moleculenet/tox21/ --split=scaffold_split --compound_mode=clamp\n```\n\nYou can also use the clamp-encoding of a pretrained model by providing an mlflow run-directory:\nYou have to specify the correct compound_features_size as well as the assay_features_size of the model.\n```\npython clamp/linear_probe.py ./data/moleculenet/hiv/ --split=scaffold_split --run_dir=./mlruns/711448512597702417/c00af103806c4243b816ecf2aed7387a/ --compound_features_size=8192 --assay_features_size=867\n```\n\nA further example can be found in the [colab-demo](https://colab.research.google.com/github/ml-jku/clamp/blob/main/notebooks/CLAMP_colab_demo.ipynb).\n\n\n## :books: Cite\nIf you find this work helpful, please cite\n```bibtex\n@article{seidl2023clamp,\n   author = {Seidl, Philipp and Vall, Andreu and Hochreiter, Sepp and Klambauer, G{\\\"u}nter},\n   title = {Enhancing Activity Prediction Models in Drug Discovery with the Ability to Understand Human Language},\n   journal = {Proceedings of the 40th International Conference on Machine Learning (ICML)},\n   institution = {Institute for Machine Learning, Johannes Kepler University, Linz},\n   year = {2023},\n   month = {July},\n   eprint={2303.03363},\n   doi = {}\n}\n```\n\n## Keywords\nDrug Discovery, Machine Learning, Zero-shot, NLP, LLM, Scientific Language Model\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fml-jku%2Fclamp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fml-jku%2Fclamp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fml-jku%2Fclamp/lists"}