{"id":13958640,"url":"https://github.com/microsoft/FS-Mol","last_synced_at":"2025-07-21T00:31:35.221Z","repository":{"id":45562953,"uuid":"395745872","full_name":"microsoft/FS-Mol","owner":"microsoft","description":"FS-Mol  is A Few-Shot Learning Dataset of Molecules, containing molecular compounds with measurements of activity against a variety of protein targets. The dataset is presented with a model evaluation benchmark which aims to drive few-shot learning research in the domain of molecules and graph-structured data.","archived":false,"fork":false,"pushed_at":"2023-02-03T20:59:37.000Z","size":463477,"stargazers_count":168,"open_issues_count":10,"forks_count":23,"subscribers_count":11,"default_branch":"main","last_synced_at":"2025-07-19T05:48:37.063Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null}},"created_at":"2021-08-13T17:56:16.000Z","updated_at":"2025-05-25T04:12:53.000Z","dependencies_parsed_at":"2023-02-19T19:35:16.290Z","dependency_job_id":null,"html_url":"https://github.com/microsoft/FS-Mol","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/microsoft/FS-Mol","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FFS-Mol","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FFS-Mol/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FFS-Mol/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FFS-Mol/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/FS-Mol/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FFS-Mol/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266221272,"owners_count":23894966,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-08T13:01:47.182Z","updated_at":"2025-07-21T00:31:30.212Z","avatar_url":"https://github.com/microsoft.png","language":"Python","funding_links":[],"categories":["分子"],"sub_categories":["网络服务_其他"],"readme":"\u003c!--\n\u003cscript type=\"application/ld+json\"\u003e\n  {\n    \"@context\": \"https://schema.org\",\n    \"@type\": \"Dataset\",\n    \"name\": \"FS-Mol\",\n    \"description\": \"A Few-Shot Learning Dataset of Molecules\",\n    \"url\": \"https://github.com/microsoft/FS-Mol/tree/main/datasets\",\n    \"license\": \"https://creativecommons.org/licenses/by-sa/3.0/\",\n    \"isAccessibleForFree\" : true,\n  }\n\u003c/script\u003e\n--\u003e\n\n# FS-Mol: A Few-Shot Learning Dataset of Molecules\n\nThis repository contains data and code for FS-Mol: A Few-Shot Learning Dataset of Molecules.\n\n## Installation\n\n1. Clone or download this repository\n2. Install dependencies\n\n   ```\n   cd FS-Mol\n\n   conda env create -f environment.yml\n   conda activate fsmol\n   ```\n\nThe code for the Molecule Attention Transformer baseline is added as a submodule of this repository. Hence, in order to be able to run MAT, one has to clone our repository via `git clone --recurse-submodules`. Alternatively, one can first clone our repository normally, and then set up submodules via `git submodule update --init`. If the MAT submodule is not set up, all the other parts of our repository should continue to work.\n\n## Data\n\nThe dataset is available as a download, [FS-Mol Data](https://figshare.com/ndownloader/files/31345321), split into `train`, `valid` and `test` folders. Additionally, we specify which tasks are to be used with the file `datasets/fsmol-0.1.json`, a default list of tasks for each data fold. We note that the complete dataset contains many more tasks. Should use of all possible training tasks available be desired, the training script argument `--task_list_file datasets/entire_train_set.json` should be used. The task lists will be used to version FS-Mol in future iterations as more data becomes available via ChEMBL.\n\nTasks are stored as individual compressed [JSONLines](https://jsonlines.org/) files, with each line corresponding to the information to a single datapoint for the task.\nEach datapoint is stored as a JSON dictionary, following a fixed structure:\n```json\n{\n    \"SMILES\": \"SMILES_STRING\",\n    \"Property\": \"ACTIVITY BOOL LABEL\",\n    \"Assay_ID\": \"CHEMBL ID\",\n    \"RegressionProperty\": \"ACTIVITY VALUE\",\n    \"LogRegressionProperty\": \"LOG ACTIVITY VALUE\",\n    \"Relation\": \"ASSUMED RELATION OF MEASURED VALUE TO TRUE VALUE\",\n    \"AssayType\": \"TYPE OF ASSAY\",\n    \"fingerprints\": [...],\n    \"descriptors\": [...],\n    \"graph\": {\n        \"adjacency_lists\": [\n           [... SINGLE BONDS AS PAIRS ...],\n           [... DOUBLE BONDS AS PAIRS ...],\n           [... TRIPLE BONDS AS PAIRS ...]\n        ],\n        \"node_types\": [...ATOM TYPES...],\n        \"node_features\": [...NODE FEATURES...],\n    }\n}\n```\n\n### FSMolDataset\nThe `fs_mol.data.FSMolDataset` class provides programmatic access in Python to the train/valid/test tasks of the few-shot dataset.\nAn instance is created from the data directory by `FSMolDataset.from_directory(/path/to/dataset)`.\nMore details and examples of how to use `FSMolDataset` are available in `fs_mol/notebooks/dataset.ipynb`.\n\n## Evaluating a new Model\n\nWe have provided an implementation of the FS-Mol evaluation methodology in `fs_mol.utils.eval_utils.eval_model()`.\nThis is a framework-agnostic python method, and we demonstrate how to use it for evaluating a new model in detail in `notebooks/evaluation.ipynb`.\n\nNote that our baseline test scripts (`fs_mol/baseline_test.py`, `fs_mol/maml_test.py`, `fs_mol/mat_test`, `fs_mol/multitask_test.py` and `fs_mol/protonet_test.py`) use this method as well and can serve as examples on how to integrate per-task fine-tuning in TensorFlow (`maml_test.py`), fine-tuning in PyTorch (`mat_test.py`) and single-task training for scikit-learn models (`baseline_test.py`).\nThese scripts also support the `--task_list_file` parameter to choose different sets of test tasks, as required.\n\n## Baseline Model Implementations\n\nWe provide implementations for three key few-shot learning methods: Multitask learning, Model-Agnostic Meta-Learning, and Prototypical Networks, as well as evaluation on the Single-Task baselines and the Molecule Attention Transformer (MAT) [paper](https://arxiv.org/abs/2002.08264v1), [code](https://github.com/lucidrains/molecule-attention-transformer). \n\nAll results and associated plots are found in the baselines/ directory. \n\nThese baseline methods can be run on the FS-Mol dataset as follows:\n\n### kNNs and Random Forests -- Single Task Baselines\n\nOur kNN and RF baselines are obtained by permitting grid-search over a industry-standard parameter set, detailed in the script `baseline_test.py`.\n\nThe baseline single-task evaluation can be run as follows, with a choice of kNN or randomForest model:\n\n```bash\npython fs_mol/baseline_test.py /path/to/data --model {kNN, randomForest}\n```\n\n### Molecule Attention Transformer\n\nThe Molecule Attention Transformer (MAT) [paper](https://arxiv.org/abs/2002.08264v1), [code](https://github.com/lucidrains/molecule-attention-transformer). \n\nThe Molecule Attention Transformer can be evaluated as:\n\n```bash\npython fs_mol/mat_test.py /path/to/pretrained-mat /path/to/data\n```\n\n### GNN-MAML pre-training and evaluation\n\nThe GNN-MAML model consists of a GNN operating on the molecular graph representations of the dataset. The model consists of a $8$-layer GNN with node-embedding dimension $128$. The GNN uses \"Edge-MLP\" message passing. The model was trained with a support set size of $16$ according to the MAML procedure [Finn 2017](http://proceedings.mlr.press/v70/finn17a/finn17a.pdf). The hyperparameters used in the model checkpoint are default settings of `maml_train.py`.\n\nThe current defaults were used to train the final versions of GNN-MAML available here. \n\n```bash\npython fs_mol/maml_train.py /path/to/data \n```\n\nEvaluation is run as: \n\n```bash\npython fs_mol/maml_test.py /path/to/data --trained_model /path/to/gnn-maml-checkpoint\n```\n\n### GNN-MT pre-training and evaluation\n\nThe GNN-MT model consists of a GNN operating on the molecular graph representations of the dataset. The model consists of a $10$-layer GNN with node-embedding dimension $128$. The model uses principal neighbourhood aggregation (PNA) message passing. The hyperparameters used in the model checkpoint are default settings of `multitask_train.py`. This method has similarities to the approach taken for the task-only training contained within [Hu 2019](https://arxiv.org/abs/1905.12265v1)\n\n```bash\npython fs_mol/multitask_train.py /path/to/data \n```\n\nEvaluation is run as: \n\n```bash\npython fs_mol/multitask_test.py /path/to/gnn-mt-checkpoint /path/to/data\n```\n\n### Prototypical Networks (PN) pre-training and evaluation\n\nThe prototypical networks method [Snell 2017](https://proceedings.neurips.cc/paper/2017/file/cb8da6767461f2812ae4290eac7cbc42-Paper.pdf) extracts representations of support set datapoints and uses these to classify positive and negative examples. We here used the Mahalonobis distance as a metric for query point distance to class prototypes. \n\n```bash\npython fs_mol/protonet_train.py /path/to/data \n```\n\nEvaluation is run as: \n\n```bash\npython fs_mol/protonet_test.py /path/to/pn-checkpoint /path/to/data\n```\n\n## Available Model Checkpoints\n\nWe provide pre-trained models for `GNN-MAML`, `GNN-MT` and `PN`, these are downloadable from the links to [figshare](https://figshare.com/projects/FS-Mol_Dataset_and_Models/125797).\n\n| Model Name | Description                                                              | Checkpoint File                                                                       |\n| ---------- | ------------------------------------------------------------------------ | ------------------------------------------------------------------------------------- |\n| GNN-MAML   | Support set size 16. 8-layer GNN. Edge MLP message passing.              | [MAML-Support16_best_validation.pkl](https://figshare.com/ndownloader/files/31346701) |\n| GNN-MT     | 10-layer GNN. PNA message passing                                        | [multitask_best_model.pt](https://figshare.com/ndownloader/files/31338334)              |\n| PN         | 10-layer GGN, PNA message passing. ECFP+GNN, Mahalonobis distance metric | [PN-Support64_best_validation.pt](https://figshare.com/ndownloader/files/31307479)    |\n\n\n## Specifying, Training and Evaluating New Model Implementations\n\nFlexible definition of few-shot models and single task models is defined as demonstrated in the range of train and test scripts in `fs_mol`. \n\nWe give a detailed example of how to use the abstract class `AbstractTorchFSMolModel` in `notebooks/integrating_torch_models.ipynb` to integrate a new general PyTorch model, and note that the evaluation procedure described below is demonstrated on `sklearn` models in `fs_mol/baseline_test.py` and on a Tensorflow-based GNN model in `fs_mol/maml_test.py`.\n\n## Contributing\n\nThis project welcomes contributions and suggestions.  Most contributions require you to agree to a\nContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us\nthe rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.\n\nWhen you submit a pull request, a CLA bot will automatically determine whether you need to provide\na CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions\nprovided by the bot. You will only need to do this once across all repos using our CLA.\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).\nFor more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or\ncontact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.\n\n## Trademarks\n\nThis project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft \ntrademarks or logos is subject to and must follow \n[Microsoft's Trademark \u0026 Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).\nUse of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.\nAny use of third-party trademarks or logos are subject to those third-party's policies.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2FFS-Mol","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2FFS-Mol","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2FFS-Mol/lists"}