{"id":22371316,"url":"https://github.com/gt4sd/gt4sd-core","last_synced_at":"2025-05-15T00:11:06.478Z","repository":{"id":36954644,"uuid":"458309249","full_name":"GT4SD/gt4sd-core","owner":"GT4SD","description":"GT4SD, an open-source library to accelerate hypothesis generation in the scientific discovery process.","archived":false,"fork":false,"pushed_at":"2025-02-19T13:51:09.000Z","size":29201,"stargazers_count":351,"open_issues_count":0,"forks_count":76,"subscribers_count":16,"default_branch":"main","last_synced_at":"2025-05-01T13:43:43.365Z","etag":null,"topics":["deep-learning","generative-models","machine-learning","python"],"latest_commit_sha":null,"homepage":"https://gt4sd.github.io/gt4sd-core/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GT4SD.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-02-11T19:06:58.000Z","updated_at":"2025-04-30T15:15:24.000Z","dependencies_parsed_at":"2025-02-28T14:12:15.136Z","dependency_job_id":"460d39ae-2b1c-4bde-98d4-6dd7effc7be8","html_url":"https://github.com/GT4SD/gt4sd-core","commit_stats":null,"previous_names":[],"tags_count":92,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GT4SD%2Fgt4sd-core","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GT4SD%2Fgt4sd-core/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GT4SD%2Fgt4sd-core/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GT4SD%2Fgt4sd-core/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GT4SD","download_url":"https://codeload.github.com/GT4SD/gt4sd-core/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254249206,"owners_count":22039029,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","generative-models","machine-learning","python"],"created_at":"2024-12-04T20:18:52.835Z","updated_at":"2025-05-15T00:11:01.470Z","avatar_url":"https://github.com/GT4SD.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GT4SD (Generative Toolkit for Scientific Discovery)\n\n[![PyPI version](https://badge.fury.io/py/gt4sd.svg)](https://badge.fury.io/py/gt4sd)\n[![Actions tests](https://github.com/gt4sd/gt4sd-core/actions/workflows/tests.yaml/badge.svg)](https://github.com/gt4sd/gt4sd-core/actions)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Contributions](https://img.shields.io/badge/contributions-welcome-blue)](https://github.com/GT4SD/gt4sd-core/blob/main/CONTRIBUTING.md)\n[![Docs](https://img.shields.io/badge/website-live-brightgreen)](https://gt4sd.github.io/gt4sd-core/)\n[![Total downloads](https://static.pepy.tech/badge/gt4sd)](https://pepy.tech/project/gt4sd)\n[![Monthly downloads](https://static.pepy.tech/badge/gt4sd/month)](https://pepy.tech/project/gt4sd)\n[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/GT4SD/gt4sd-core/main)\n[![DOI](https://zenodo.org/badge/458309249.svg)](https://zenodo.org/badge/latestdoi/458309249)\n[![2022 IEEE Open Software Services Award](https://img.shields.io/badge/Award-2022%20IEEE%20Open%20Software%20Services%20Award-yellow)](https://conferences.computer.org/services/2022/awards/oss_award.html)\n[![Paper DOI: 10.1038/s41524-023-01028-1](https://zenodo.org/badge/DOI/10.1038/s41524-023-01028-1.svg)](https://www.nature.com/articles/s41524-023-01028-1)\n\n\u003cimg src=\"./docs/_static/gt4sd_graphical_abstract.png\" alt=\"logo\" width=\"800\"\u003e\n\n\nThe **GT4SD** (Generative Toolkit for Scientific Discovery) is an open-source platform to accelerate hypothesis generation in the scientific discovery process. It provides a library for making state-of-the-art generative AI models easier to use.\n\nFor full details on the library API and examples see the [docs](https://gt4sd.github.io/gt4sd-core/).\nAlmost all pretrained models are also available via `gradio`-powered [web apps](https://huggingface.co/GT4SD) on Hugging Face Spaces.\n\n## Installation\n\n### Requirements\n\nCurrently `gt4sd` relies on:\n\n- python\u003e=3.7,\u003c=3.10\n- pip==24.0\n\nIf you need others, help us by [contributing](./CONTRIBUTING.md) to the project.\n\n### Conda\n\nThe recommended way to install the `gt4sd` is to create a dedicated conda environment, this will ensure all requirements are satisfied. For CPU:\n\n```sh\ngit clone https://github.com/GT4SD/gt4sd-core.git\ncd gt4sd-core/\nconda env create -f conda_cpu_mac.yml # for linux use conda_cpu_linux.yml\nconda activate gt4sd\npip install gt4sd\n```\n\n**NOTE 1:** By default `gt4sd` is installed with CPU requirements. For GPU usage replace with:\n\n```sh\nconda env create -f conda_gpu.yml\n```\n\n**NOTE 2:** In case you want to reuse an existing compatible environment (see [requirements](#requirements)), you can use `pip`, but as of now (:eyes: on [issue](https://github.com/GT4SD/gt4sd-core/issues/31) for changes), some dependencies require installation from GitHub, so for a complete setup install them with:\n\n```sh\npip install -r vcs_requirements.txt\n```\n\nA few VCS dependencies require Git LFS (make sure it's available on your system).\n\n### Development setup \u0026 installation\n\nIf you would like to contribute to the package, we recommend to install gt4sd in\neditable mode inside your `conda` environment:\n\n```sh\npip install --no-deps -e .\n```\n\nLearn more in [CONTRIBUTING.md](./CONTRIBUTING.md)\n\n## Getting started\n\nAfter install you can use `gt4sd` right away in your discovery workflows.\n\n\u003cimg src=\"./docs/_static/gt4sd_case_study.jpg\" alt=\"logo\" width=\"800\"/\u003e\n\n\n### Running inference pipelines in your python code\n\nRunning an algorithm is as easy as typing:\n\n```python\nfrom gt4sd.algorithms.conditional_generation.paccmann_rl.core import (\n    PaccMannRLProteinBasedGenerator, PaccMannRL\n)\ntarget = 'MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT'\n# algorithm configuration with default parameters\nconfiguration = PaccMannRLProteinBasedGenerator()\n# instantiate the algorithm for sampling\nalgorithm = PaccMannRL(configuration=configuration, target=target)\nitems = list(algorithm.sample(10))\nprint(items)\n```\n\nOr you can use the `ApplicationRegistry` to run an algorithm instance using a serialized representation of the algorithm:\n\n```python\nfrom gt4sd.algorithms.registry import ApplicationsRegistry\ntarget = 'MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT'\nalgorithm = ApplicationsRegistry.get_application_instance(\n    target=target,\n    algorithm_type='conditional_generation',\n    domain='materials',\n    algorithm_name='PaccMannRL',\n    algorithm_application='PaccMannRLProteinBasedGenerator',\n    generated_length=32,\n    # include additional configuration parameters as **kwargs\n)\nitems = list(algorithm.sample(10))\nprint(items)\n```\n\n### Running inference pipelines via the CLI command\n\nGT4SD can run inference pipelines based on the `gt4sd-inference` CLI command.\nIt allows to run all inference algorithms directly from the command line.\nTo see which algorithms are available and how to use the CLI for your favorite model,\ncheck out [examples/cli/README.md](./examples/cli/README.md).\n\nYou can run inference pipelines simply typing:\n\n```console\ngt4sd-inference --algorithm_name PaccMannRL --algorithm_application PaccMannRLProteinBasedGenerator --target MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT --number_of_samples 10\n```\n\nThe command supports multiple parameters to select an algorithm and configure it for inference:\n\n```console\n$ gt4sd-inference --help\nusage: gt4sd-inference [-h] [--algorithm_type ALGORITHM_TYPE]\n                       [--domain DOMAIN] [--algorithm_name ALGORITHM_NAME]\n                       [--algorithm_application ALGORITHM_APPLICATION]\n                       [--algorithm_version ALGORITHM_VERSION]\n                       [--target TARGET]\n                       [--number_of_samples NUMBER_OF_SAMPLES]\n                       [--configuration_file CONFIGURATION_FILE]\n                       [--print_info [PRINT_INFO]]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --algorithm_type ALGORITHM_TYPE\n                        Inference algorithm type, supported types:\n                        conditional_generation, controlled_sampling,\n                        generation, prediction. (default: None)\n  --domain DOMAIN       Domain of the inference algorithm, supported types:\n                        materials, nlp. (default: None)\n  --algorithm_name ALGORITHM_NAME\n                        Inference algorithm name. (default: None)\n  --algorithm_application ALGORITHM_APPLICATION\n                        Inference algorithm application. (default: None)\n  --algorithm_version ALGORITHM_VERSION\n                        Inference algorithm version. (default: None)\n  --target TARGET       Optional target for generation represented as a\n                        string. Defaults to None, it can be also provided in\n                        the configuration_file as an object, but the\n                        commandline takes precendence. (default: None)\n  --number_of_samples NUMBER_OF_SAMPLES\n                        Number of generated samples, defaults to 5. (default:\n                        5)\n  --configuration_file CONFIGURATION_FILE\n                        Configuration file for the inference pipeline in JSON\n                        format. (default: None)\n  --print_info [PRINT_INFO]\n                        Print info for the selected algorithm, preventing\n                        inference run. Defaults to False. (default: False)\n```\n\nYou can use `gt4sd-inference` to directly get information on the configuration parameters for the selected algorithm:\n\n```console\ngt4sd-inference --algorithm_name PaccMannRL --algorithm_application PaccMannRLProteinBasedGenerator --print_info\nINFO:gt4sd.cli.inference:Selected algorithm: {'algorithm_type': 'conditional_generation', 'domain': 'materials', 'algorithm_name': 'PaccMannRL', 'algorithm_application': 'PaccMannRLProteinBasedGenerator', 'algorithm_version': 'v0'}\nINFO:gt4sd.cli.inference:Selected algorithm support the following configuration parameters:\n{\n \"batch_size\": {\n  \"description\": \"Batch size used for the generative model sampling.\",\n  \"title\": \"Batch Size\",\n  \"default\": 32,\n  \"type\": \"integer\",\n  \"optional\": true\n },\n \"temperature\": {\n  \"description\": \"Temperature parameter for the softmax sampling in decoding.\",\n  \"title\": \"Temperature\",\n  \"default\": 1.4,\n  \"type\": \"number\",\n  \"optional\": true\n },\n \"generated_length\": {\n  \"description\": \"Maximum length in tokens of the generated molcules (relates to the SMILES length).\",\n  \"title\": \"Generated Length\",\n  \"default\": 100,\n  \"type\": \"integer\",\n  \"optional\": true\n }\n}\nTarget information:\n{\n \"target\": {\n  \"title\": \"Target protein sequence\",\n  \"description\": \"AA sequence of the protein target to generate non-toxic ligands against.\",\n  \"type\": \"string\"\n }\n}\n```\n\n### Running training pipelines via the CLI command\n\nGT4SD provides a trainer client based on the `gt4sd-trainer` CLI command.\n\nThe trainer currently supports the following training pipelines:\n\n- `language-modeling-trainer`: language modelling via HuggingFace transfomers and PyTorch Lightning.\n- `paccmann-vae-trainer`: PaccMann VAE models.\n- `granular-trainer`: multimodal compositional autoencoders supporting MLP, RNN and Transformer layers.\n- `guacamol-lstm-trainer`: GuacaMol LSTM models.\n- `moses-organ-trainer`: Moses Organ implementation.\n- `moses-vae-trainer`: Moses VAE models.\n- `torchdrug-gcpn-trainer`: TorchDrug Graph Convolutional Policy Network model.\n- `torchdrug-graphaf-trainer`: TorchDrug autoregressive GraphAF model.\n- `diffusion-trainer`: Diffusers model.\n- `gflownet-trainer`: GFlowNet model.\n\n```console\n$ gt4sd-trainer --help\nusage: gt4sd-trainer [-h] --training_pipeline_name TRAINING_PIPELINE_NAME\n                     [--configuration_file CONFIGURATION_FILE]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --training_pipeline_name TRAINING_PIPELINE_NAME\n                        Training type of the converted model, supported types:\n                        granular-trainer, language-modeling-trainer, paccmann-\n                        vae-trainer. (default: None)\n  --configuration_file CONFIGURATION_FILE\n                        Configuration file for the trainining. It can be used\n                        to completely by-pass pipeline specific arguments.\n                        (default: None)\n```\n\nTo launch a training you have two options.\n\nYou can either specify the training pipeline and the path of a configuration file that contains the needed training parameters:\n\n```sh\ngt4sd-trainer  --training_pipeline_name ${TRAINING_PIPELINE_NAME} --configuration_file ${CONFIGURATION_FILE}\n```\n\nOr you can provide directly the needed parameters as arguments:\n\n```sh\ngt4sd-trainer  --training_pipeline_name language-modeling-trainer --type mlm --model_name_or_path mlm --training_file /path/to/train_file.jsonl --validation_file /path/to/valid_file.jsonl\n```\n\nTo get more info on a specific training pipeleins argument simply type:\n\n```sh\ngt4sd-trainer --training_pipeline_name ${TRAINING_PIPELINE_NAME} --help\n```\n\n### Saving a trained algorithm for inference via the CLI command\n\nOnce a training pipeline has been run via the `gt4sd-trainer`, it's possible to save the trained algorithm via `gt4sd-saving` for usage in compatible inference pipelines.\n\nHere a small example for `PaccMannGP` algorithm ([paper](https://doi.org/10.1021/acs.jcim.1c00889)).\n\nYou can train a model with `gt4sd-trainer` (quick training using few data, not really recommended for a realistic model :warning:):\n\n```sh\ngt4sd-trainer  --training_pipeline_name paccmann-vae-trainer --epochs 250 --batch_size 4 --n_layers 1 --rnn_cell_size 16 --latent_dim 16 --train_smiles_filepath src/gt4sd/training_pipelines/tests/molecules.smi --test_smiles_filepath src/gt4sd/training_pipelines/tests/molecules.smi --model_path /tmp/gt4sd-paccmann-gp/ --training_name fast-example --eval_interval 15 --save_interval 15 --selfies\n```\n\nSave the model with the compatible inference pipeline using `gt4sd-saving`:\n\n```sh\ngt4sd-saving --training_pipeline_name paccmann-vae-trainer --model_path /tmp/gt4sd-paccmann-gp/ --training_name fast-example --target_version fast-example-v0 --algorithm_application PaccMannGPGenerator\n```\n\nRun the algorithm via `gt4sd-inference` (again the model produced in the example is trained on dummy data and will give dummy outputs, do not use it as is :no_good:):\n\n```sh\ngt4sd-inference --algorithm_name PaccMannGP --algorithm_application PaccMannGPGenerator --algorithm_version fast-example-v0 --number_of_samples 5  --target '{\"molwt\": {\"target\": 60.0}}'\n```\n\n### Uploading a trained algorithm on a public hub via the CLI command\n\nYou can upload trained and finetuned models easily in the public hub using `gt4sd-upload`. The syntax follows the saving pipeline:\n\n```sh\ngt4sd-upload --training_pipeline_name paccmann-vae-trainer --model_path /tmp/gt4sd-paccmann-gp --training_name fast-example --target_version fast-example-v0 --algorithm_application PaccMannGPGenerator\n```\n\n**NOTE:** GT4SD can be configured to upload models to a custom or self-hosted COS.\nAn example on self-hosting locally a COS (minio) where to upload your models can be found [here](https://gt4sd.github.io/gt4sd-core/source/gt4sd_server_upload_md.html).\n\n\n### Computing properties\n\nYou can compute properties of your generated samples using the `gt4sd.properties` submodule:\n\n```python\n\u003e\u003e\u003efrom gt4sd.properties import PropertyPredictorRegistry\n\u003e\u003e\u003esimilarity_predictor = PropertyPredictorRegistry.get_property_predictor(\"similarity_seed\", {\"smiles\" : \"C1=CC(=CC(=C1)Br)CN\"})\n\u003e\u003e\u003esimilarity_predictor(\"CCO\")\n0.0333\n\u003e\u003e\u003e# let's inspect what other parameters we can set for similarity measuring\n\u003e\u003e\u003esimilarity_predictor = PropertyPredictorRegistry.get_property_predictor(\"similarity_seed\", {\"smiles\" : \"C1=CC(=CC(=C1)Br)CN\", \"fp_key\": \"ECFP6\"})\n\u003e\u003e\u003esimilarity_predictor(\"CCO\")\n\u003e\u003e\u003e# inspect parameters\n\u003e\u003e\u003ePropertyPredictorRegistry.get_property_predictor_parameters_schema(\"similarity_seed\")\n'{\"title\": \"SimilaritySeedParameters\", \"description\": \"Abstract class for property computation.\", \"type\": \"object\", \"properties\": {\"smiles\": {\"title\": \"Smiles\", \"example\": \"c1ccccc1\", \"type\": \"string\"}, \"fp_key\": {\"title\": \"Fp Key\", \"default\": \"ECFP4\", \"type\": \"string\"}}, \"required\": [\"smiles\"]}'\n\u003e\u003e\u003e# predict other properties\n\u003e\u003e\u003eqed = PropertyPredictorRegistry.get_property_predictor(\"qed\")\n\u003e\u003e\u003eqed('CCO')\n0.4068\n\u003e\u003e\u003e# list properties\n\u003e\u003e\u003ePropertyPredictorRegistry.list_available()\n['activity_against_target',\n 'aliphaticity',\n ...\n 'scscore',\n 'similarity_seed',\n 'tpsa',\n 'weight']\n```\n\n### Additional examples\n\nFind more examples in [notebooks](./notebooks)\n\nYou can play with them right away using the provided Dockerfile, simply build the image and run it to explore the examples using Jupyter:\n\n```sh\ndocker build -f Dockerfile -t gt4sd-demo .\ndocker run -p 8888:8888 gt4sd-demo\n```\n\n## Supported packages\n\nBeyond implementing various generative modeling inference and training pipelines GT4SD is designed to provide a high-level API that implement an harmonized interface for several existing packages:\n\n- [GuacaMol](https://github.com/BenevolentAI/guacamol): inference pipelines for the baselines models and training pipelines for LSTM models.\n- [Moses](https://github.com/molecularsets/moses): inference pipelines for the baselines models and training pipelines for VAEs and Organ.\n- [TorchDrug](https://github.com/DeepGraphLearning/torchdrug): inference and training pipelines for GCPN and GraphAF models. Training pipelines support custom datasets as well as datasets native in TorchDrug.\n- [MoLeR](https://github.com/microsoft/molecule-generation): inference pipelines for MoLeR (**MO**lecule-**LE**vel **R**epresentation) generative models for de-novo and scaffold-based generation.\n- [TAPE](https://github.com/songlab-cal/tape): encoder modules compatible with the protein language models.\n- [PaccMann](https://github.com/PaccMann/): inference pipelines for all algorithms of the PaccMann family as well as training pipelines for the generative VAEs.\n- [transformers](https://huggingface.co/transformers): training and inference pipelines for generative models from [HuggingFace Models](https://huggingface.co/models)\n- [diffusers](https://github.com/huggingface/diffusers): training and inference pipelines for generative models from [Diffusers Models](https://github.com/huggingface/diffusers)\n- [GFlowNets](https://github.com/recursionpharma/gflownet): training and inference pipeline for [Generative Flow Networks](https://yoshuabengio.org/2022/03/05/generative-flow-networks/)\n- [MolGX](https://github.com/GT4SD/molgx-core/): training and inference pipelines to generate small molecules satisfying target properties. The full implementation of MolGX, including additional functionalities, is available [here](https://github.com/GT4SD/molgx-core/).\n- [Regression Transformers](https://github.com/IBM/regression-transformer/): training and inference pipelines to generate small molecules, polymers or peptides based on numerical property constraints. For details [read the paper](https://www.nature.com/articles/s42256-023-00639-z).\n\n\n## References\n\nIf you use `gt4sd` in your projects, please consider citing the following:\n\n```bib\n@software{GT4SD,\n  author = {GT4SD Team},\n  month = {2},\n  title = {{GT4SD (Generative Toolkit for Scientific Discovery)}},\n  url = {https://github.com/GT4SD/gt4sd-core},\n  version = {main},\n  year = {2022}\n}\n\n@article{manica2022gt4sd,\n  title={Accelerating material design with the generative toolkit for scientific discovery},\n  author={Manica, Matteo and Born, Jannis and Cadow, Joris and Christofidellis, Dimitrios and Dave, Ashish and Clarke, Dean and Teukam, Yves Gaetan Nana and Giannone, Giorgio and Hoffman, Samuel C and Buchan, Matthew and others},\n  journal={npj Computational Materials},\n  volume={9},\n  number={1},\n  pages={69},\n  year={2023},\n  publisher={Nature Publishing Group UK London}\n}\n```\n\n## License\n\nThe `gt4sd` codebase is under MIT license.\nFor individual model usage, please refer to the model licenses found in the original packages.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgt4sd%2Fgt4sd-core","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgt4sd%2Fgt4sd-core","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgt4sd%2Fgt4sd-core/lists"}