{"id":43894399,"url":"https://github.com/recursionpharma/mole_public","last_synced_at":"2026-02-06T17:14:50.156Z","repository":{"id":297568174,"uuid":"867719701","full_name":"recursionpharma/mole_public","owner":"recursionpharma","description":"Recursion's molecular foundation model","archived":false,"fork":false,"pushed_at":"2025-06-06T05:51:41.000Z","size":300,"stargazers_count":50,"open_issues_count":4,"forks_count":2,"subscribers_count":5,"default_branch":"trunk","last_synced_at":"2025-06-06T06:29:14.410Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/recursionpharma.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-10-04T15:32:02.000Z","updated_at":"2025-06-06T05:51:42.000Z","dependencies_parsed_at":"2025-06-06T06:39:44.806Z","dependency_job_id":null,"html_url":"https://github.com/recursionpharma/mole_public","commit_stats":null,"previous_names":["recursionpharma/mole_public"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/recursionpharma/mole_public","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/recursionpharma%2Fmole_public","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/recursionpharma%2Fmole_public/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/recursionpharma%2Fmole_public/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/recursionpharma%2Fmole_public/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/recursionpharma","download_url":"https://codeload.github.com/recursionpharma/mole_public/tar.gz/refs/heads/trunk","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/recursionpharma%2Fmole_public/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29169403,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-06T16:33:35.550Z","status":"ssl_error","status_checked_at":"2026-02-06T16:33:30.716Z","response_time":59,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-02-06T17:14:49.451Z","updated_at":"2026-02-06T17:14:50.149Z","avatar_url":"https://github.com/recursionpharma.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![scorecard-score](https://github.com/recursionpharma/octo-guard-badges/blob/trunk/badges/repo/mole_public/maturity_score.svg?raw=true)](https://infosec-docs.prod.rxrx.io/octoguard/scorecards/mole_public)\n[![scorecard-status](https://github.com/recursionpharma/octo-guard-badges/blob/trunk/badges/repo/mole_public/scorecard_status.svg?raw=true)](https://infosec-docs.prod.rxrx.io/octoguard/scorecards/mole_public)\n# MolE - Molecular Embeddings\n\n\u003c!-- TABLE OF CONTENTS --\u003e\n\u003cdetails open=\"open\"\u003e\n  \u003csummary\u003eTable of Contents\u003c/summary\u003e\n  \u003col\u003e\n    \u003cli\u003e\u003ca href=\"#about-the-project\"\u003eAbout The Project\u003c/a\u003e\u003c/li\u003e\n    \u003cli\u003e\n      \u003ca href=\"#getting-started\"\u003eGetting Started\u003c/a\u003e\n      \u003cul\u003e\n        \u003cli\u003e\u003ca href=\"#installation\"\u003eInstallation\u003c/a\u003e\u003c/li\u003e\n        \u003cli\u003e\u003ca href=\"#download-pretrained-models\"\u003eDownload pretrained models\u003c/a\u003e\u003c/li\u003e\n      \u003c/ul\u003e\n    \u003c/li\u003e\n    \u003cli\u003e\n      \u003ca href=\"#usage\"\u003eUsage\u003c/a\u003e\n      \u003cul\u003e\n        \u003cli\u003e\u003ca href=\"#how-to-train-a-model\"\u003eHow to train a model?\u003c/a\u003e\u003c/li\u003e\n        \u003cli\u003e\u003ca href=\"#how-to-predict-properties-of-new-molecules\"\u003eHow to predict properties of new molecules?\u003c/a\u003e\u003c/li\u003e\n        \u003cli\u003e\u003ca href=\"#how-to-compute-embeddings-for-molecules\"\u003eHow to compute embeddings for molecules?\u003c/a\u003e\u003c/li\u003e\n      \u003c/ul\u003e\n    \u003c/li\u003e\n    \u003cli\u003e\u003ca href=\"#license\"\u003eLicense\u003c/a\u003e\u003c/li\u003e\n  \u003c/ol\u003e\n\u003c/details\u003e\n\n\u003c!-- ABOUT THE PROJECT --\u003e\n## About The Project\n\nMolE is Recursion's foundation model for chemistry wihch combines geometric deep learning with\ntransformers, an architecture commonly used to train Large Language Models (LLMs), to learn a meaningful\nrepresentation of molecules. MolE was designed to mitigate the challenge of accurately predicting\nchemical properties from small public or private datasets. MolE leverages extensive labeled and unlabeled\ndatasets in two pretraining steps. First it follows a novel self-supervised strategy using the graph\nrepresentation of ~842 million molecules designed to properly learn to represent chemical structures.\nIt is followed by a massive multi-task training to assimilate biological information.\n\n\n![plot](./docs/MolE_fig.png)\n\n\n\u003c!-- GETTING STARTED --\u003e\n## Getting started\n\n### Installation\n\nIf you want to install MolE in your local environment follow these steps:\n\nFirst, create and activate your virtual environment:\n\n```bash\n# create a new virtual environment using pyenv\npyenv virtualenv mole\n\n# activate the environment\npyenv activate mole\n```\n\nNext, clone this repo and move into it:\n\n```bash\ngit clone https://github.com/recursionpharma/mole_public.git\ncd mole_public\n```\n\nProceed to install project dependencies, this should take less than 30 mins in a normal CPU:\n\n```bash\n# For Mac or CPU only:\npip install -r requirements/main_\u003cPYTHON_VERSION\u003e.txt\n\n# For CUDA:\npip install -r requirements/main_\u003cPYTHON_VERSION\u003e_gpu.txt\n\n```\nwhere `\u003cPYTHON_VERSION\u003e` could be `3.9` or `3.10`\n\nFinally, install `mole` which should take few minutes:\n\n```bash\npip install -e .\n```\n\n**NOTE**: If you are a mac user consider to use `PYTORCH_ENABLE_MPS_FALLBACK=1` as environmental variable to avoid issues between `torch` and M1 processors. You can do it by typing:\n\n```bash\necho \"export PYTORCH_ENABLE_MPS_FALLBACK=1\" \u003e\u003e .bashrc\n```\n\n### Download pretrained models\nTODO\n\n\n\u003c!-- Usage --\u003e\n## Usage\n\n### How to train a model?\n\nModels can be easily trained usign `mole_train` from the command line. `mole_train` is powered by [hydra](https://hydra.cc/docs/intro/) which allows to create a  configuration file and start the job .\n\nYou can always see the complete configuration file begore launching the job by adding `--cfg job --resolve` at the end of the command. This will print the configuration file.\n\n### Fine tune MolE\nIf you need to fine tune MolE using a smaller dataset (e.g. to predict activity in an internal project) you can do it  using `mole_train model=finetune` where you need to specify:\n\n- *data_file* [string] - this is the path to the file containing the training set in your local computer.\n- *checkpoint_path* [string] - this is the path to the file containing the pretrained model. If `null`, training will start from a randomly initialized model\n- *dropout* [float] - specifies the dropout used in the prediction head. This should be a value between 0 and 1.\n- *lr* [float] - sets the learning rate used during tratinig. Take into account that we used a linear warmup at the beginnig of the triaining.\n\nOptionally you can also add information regarding:\n\n- *task* [string] - this specifies if the task is a `regression` or `classification` problem. *Default: regression*\n- *num_tasks* [int] - Number of tasks used. This should be same as the number of property columns in the file containing your training set. *Default: 1*\n\n**Example:**\n\n```bash\n# Regression Example\nmole_train model=finetune  data_file='data/TDC_Half_Life_Obach_train_seed0.parquet' checkpoint_path=null dropout=0.1 lr=1.0e-06 task=regression num_tasks=1 model.name='MolE_Finetune_Regression' model.hyperparameters.datamodule.validation_data='data/TDC_Half_Life_Obach_valid_seed0.parquet'\n\n# Classification Example\nmole_train model=finetune  data_file='data/TDC_HIA_Hou_train_seed0.csv' checkpoint_path=null dropout=0.1 lr=1.0e-06 task=classification num_tasks=1 model.name='MolE_Finetune_Classification' model.hyperparameters.datamodule.validation_data='data/TDC_HIA_Hou_valid_seed0.csv'\n```\n\nTraining data should be a csv file containing at least a column named **smiles** and at least one property used for training\n\n| smiles    | Property1 | Property2 |\n| :---:     | :---:     | :---:     |\n| CCC       | 301       | 283       |\n| c1ccccc1  | 192       | 327       |\n\n\n### How to predict properties of new molecules?\n\nOnce you have trained a model you can use it to predict properties of new molecules in the following way.\n\n#### From a python shell or jupyter notebook\n```bash\nfrom mole import mole_predict\nimport pandas as pd\n\nsmiles= ['CCC', 'CCCCCC', 'CC', 'CCCCC']  # list of smiles\n\npredictions = mole_predict.predict(smiles=smiles, task='regression', num_tasks=1, pretrained_model=\u003cPATH_TO_CHECKPOINT\u003e, batch_size=32, num_workers=4)\n\ndf = pd.DataFrame(predictions)\ndf.insert (0, 'smiles', smiles)\ndf.head()\n```\n#### From command line\n\n```bash\nmole_predict --smiles \"CCC c1ccccc1\" --task regression --num_tasks 2 --pretrained_model \u003cpath to checkpoint\u003e\n```\n\nTO DO: predcit from smiles directly from a file\n\n### How to compute embeddings for molecules?\n\nYou can also compute embeddings of molecules using a pre-trained MolE model. These embeddings can be used as molecular fingerprints for model training or similarity search.\n\n#### From a python shell or jupyter notebook\n```bash\nfrom mole import mole_predict\n\n\nsmiles= ['CCC', 'CCCCCC', 'CC', 'CCCCC']  # list of smiles\n\nembeddings = mole_predict.encode(smiles=smiles, pretrained_model=\u003cPATH_TO_CHECKPOINT\u003e, batch_size=32, num_workers=4)\nembeddings.shape\n```\n\n\u003c!-- LICENSE --\u003e\n## License\n\nDistributed under the Attribution-NonCommercial 4.0 International License (CC-BY-NC 4.0). See `LICENSE` for more information.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frecursionpharma%2Fmole_public","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frecursionpharma%2Fmole_public","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frecursionpharma%2Fmole_public/lists"}