{"id":15118910,"url":"https://github.com/rdkit/PREFER","last_synced_at":"2025-09-28T01:31:26.582Z","repository":{"id":152195002,"uuid":"605667538","full_name":"rdkit/PREFER","owner":"rdkit","description":null,"archived":false,"fork":false,"pushed_at":"2023-07-28T07:25:31.000Z","size":290,"stargazers_count":28,"open_issues_count":2,"forks_count":3,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-01-06T04:22:50.976Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rdkit.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-02-23T16:39:38.000Z","updated_at":"2024-10-31T13:43:28.000Z","dependencies_parsed_at":null,"dependency_job_id":"403d0139-163e-40bd-a5d0-b1063703cb64","html_url":"https://github.com/rdkit/PREFER","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rdkit%2FPREFER","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rdkit%2FPREFER/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rdkit%2FPREFER/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rdkit%2FPREFER/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rdkit","download_url":"https://codeload.github.com/rdkit/PREFER/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234475315,"owners_count":18839358,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-09-26T01:53:40.155Z","updated_at":"2025-09-28T01:31:16.565Z","avatar_url":"https://github.com/rdkit.png","language":"Python","funding_links":[],"categories":["Ranked by starred repositories"],"sub_categories":[],"readme":"# Benchmarking and Property Prediction Framework (PREFER)\n\nThe PREFER framework automatizes the evaluation of different combinations of molecular representations and machine learning models for predicting molecular properties. \nIt covers different molecular representation from classical, e.g. Fingerprints and 2D Descriptors, to data-driven representations, e.g. Continuous and Data Driven representations (CDDD) [1] or MoLeR[2].\nPREFER uses AutoSklearn [3] to implement the ML model selection and the hyperparameter tuning.\n\n![caption](prefer/docs/PREFER_scheme.png)\n\n*General overview of the PREFER framework where the Model Selection part is based on [3].*\n\n## Getting Started\n\n### Installation\n\n#### Python Environment\nThe main conda environment for using PREFER can be installed from `prefer-environment.yml`, as follows:\n\n```\nconda env create -f prefer-environment.yml\n```\n\nDepending to the models employed to generate model-based molecular representations, other environments need to be installed (one for each model). The supported models in the current PREFER code are CDDD [1] and MoLeR [2]. The corresponding environments can be found in `moler-environment-light.yml` and `cddd-environment-light.yml` and can be installed as follows:\n\n```\nconda env create -f moler-environment-light.yml\nOR\nconda env create -f cddd-environment-light.yml\n```\n\nBefore running any experiments, relevant paths need to be set (including cddd and moler folders which are integrated in PREFER as git submodules), as follows:\n\n```\nPYTHONPATH=\"path_to/PREFER/prefer/model_based_representations/models/cddd/:path_to/PREFER/prefer/model_based_representations/models/molecule-generation/:path_to/PREFER/:$PYTHONPATH\"\nexport PYTHONPATH\n```\n\nNew models should be included as git submodules and add in the PYTHONPATH.\n\n#### Conda Environments in Jupyter \nTo use the PREFER conda environment in a Jupyter notebook, the environment needs to be added to Jupyter's kernelspec:\n\n```\nconda activate prefer-env\npython -m ipykernel install --user --name prefer-env --display-name \"Python (prefer-env)\"\n```\n\nCheck that Jupyter has access to this environment by running \n\n```\njupyter kernelspec list\n```\n\nThe recently added env `Python (prefer-env)` should be available now in Jupyter. \n\n\n\n## Prerequisites\n\nIn order to run PREFER, we provide one notebook (Run-PREFER.ipynb) and one python script (run_prefer_automation.py) \n\nMain steps are as follows:\n\n### STEP 0: clone the repository and unpack the git submodules\nOnce you have cloned this repository, please go into your cloned folder and run the following commands:\n\n```\ngit submodule update --init --recursive\n```\n\nThis is needed to unpack the git submodules used to connect PREFER to the models used to compute the model-based representations.\n\n### STEP 1: download public test datasets\nTwo public datasets can be used to test the code:\n- [logD](https://www.ebi.ac.uk/chembl/g/#browse/activities/filter/document_chembl_id%3ACHEMBL3301361%20AND%20standard_type%3A(%22LogD7.4%22)) from ChEMBL\n- [solubility](https://pubchem.ncbi.nlm.nih.gov/bioassay/1996) from PubChem\n\n### STEP 2: download models for calculating data-based molecular representations\nTwo models are supported currenlty as submodules in PREFER: CDDD and MOLER. \nPre-trained models can be dowloaded from:\n\n- CDDD: [here](https://drive.google.com/open?id=1oyknOulq_j0w9kzOKKIHdTLo5HphT99h)\n- MOLER: [here](https://figshare.com/ndownloader/files/34642724)\n\nSave these trained models locally, since they will be used afterwards. \n\n\n### STEP 3: set the configuration files\nFor each PREFER job a yaml config file need to be prepared as follows:\n\n1. Main settings:\n```\npath_to_df: 'path_to_df'\nexperiment_name: 'experiment_name'\nid_column_name:  'id_column_name'\nsmiles_column_name:  'smiles_column_name'\nproperties_column_name_list: \n      - 'property_1_col_name'\n      - 'property_2_col_name'\nproblem_type: 'regression' # or 'classification'\nsplitting_strategy: 'random' # or 'cluster' or 'temporal'\ntemporal_info_column_name: 'temporal_info_column_name'\n```\n\nExamples are provided in ./config_files.\n\n2. Settings for model based representations:\n```\nmodel_based_representations:\n    'model_name': \n        'path_to_model': 'path to model folder'(see STEP2)\n        'conda_env': 'name of the conda env installed for this model'\n        'submodule_path': 'path to the submodule folder included in PREFER for running the model'(e.g. path_to/prefer/model_based_representations/models/cddd/)\n    \nprefer_path: 'path_to_/PREFER/'\n```\n\nExamples of configuration file for the representations is provided in ./config_files/config_model_based_representations.yaml.\n\n\n\n### STEP 4: run Run-PREFER.ipynb notebook\nTo run the notebook `Run_PREFER.ipynb`, first of all select the correct kernel (Python (prefer-env)) and then change the needed paths, in particular:\n\n- sys.path.append('path_to/PREFER/')\n- sys.path.append('path_to/models/cddd/') # to connect CDDD model\n- sys.path.append('path_to/models/molecule-generation/') # to connect MOLER model\n\nBy running the notebook a folder (PREFER_results) will be created with the main results (benchmarking object and models). \nMoreover different folders with structure {model_name}_representations_{experiment_name} will be created containing the model_based representations.\n\nIn the notebook one can also find an example of how to use the stored PREFER-model-wrapper to predict new samples. This way the best model found for each molecular representation can be used later to predict the property under analysis. \n\nAn automatized version of the notebook can be found in `run_prefer_automation.py`. You can run it from the terminal with the following commands:\n\n```\nconda activate prefer-env\n\nPYTHONPATH=\"path_to/PREFER/prefer/model_based_representations/models/cddd/:path_to/PREFER/prefer/model_based_representations/models/molecule-generation/:path_to/PREFER/:$PYTHONPATH\"\nexport PYTHONPATH\n\npython run_prefer_automation.py --prefer_args path_to_yaml_configuration_file(see STEP3) --model_based_representations_args path_to_yaml_configuration_file_for_models_used_to_compute_the_representations(see STEP4)\n```\n\n\n## WARNING: \nPlease make sure that you select the right model type according to the dataset used (e.g.for a classification model binary labels should be provided in the dataset). \n\n## Authors\n\n* **Jessica Lanini** \n\nWith the contribution of\n- Nadine Schneider\n- Gianluca Santarossa\n- Sarah Lewis\n- Krzysztof Maziarz\n- Marwin Segler\n- Hubert Misztela\n\n\n## References\n[1] Winter, Robin, et al. \"Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations.\" Chemical science 10.6 (2019): 1692-1701.\n\n[2] Maziarz, Krzysztof, et al. \"Learning to extend molecular scaffolds with structural motifs.\" arXiv preprint arXiv:2103.03864 (2021).\n\n[3] Feurer, Matthias, et al. \"Efficient and robust automated machine learning.\" Advances in neural information processing systems 28 (2015).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frdkit%2FPREFER","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frdkit%2FPREFER","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frdkit%2FPREFER/lists"}