Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/MolecularAI/pysmilesutils
Utilities for working with SMILES based encodings of molecules for deep learning (PyTorch oriented)
https://github.com/MolecularAI/pysmilesutils
Last synced: 13 days ago
JSON representation
Utilities for working with SMILES based encodings of molecules for deep learning (PyTorch oriented)
- Host: GitHub
- URL: https://github.com/MolecularAI/pysmilesutils
- Owner: MolecularAI
- License: apache-2.0
- Created: 2021-06-29T11:45:47.000Z (over 3 years ago)
- Default Branch: master
- Last Pushed: 2022-12-02T12:14:16.000Z (about 2 years ago)
- Last Synced: 2024-05-06T00:03:26.791Z (9 months ago)
- Language: Python
- Homepage:
- Size: 11.2 MB
- Stars: 63
- Watchers: 8
- Forks: 17
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
- top-pharma50 - **MolecularAI/pysmilesutils** - 05-23 09:54:27 | (Ranked by starred repositories)
- top-pharma50 - **MolecularAI/pysmilesutils** - 05-23 09:54:27 | (Ranked by starred repositories)
README
# PySMILESutils
PySMILES utilities is a package of tools for handling encoding and decoding of SMILES for deep learning applications in PyTorch. The package contains a flexible tokenizer that can be used to analyze a given SMILES dataset using regular expressions and build a vocabulary of tokens, which can subsequently be used to encode the molecules via SMILES into pytorch tensors.
The augment class can be used for data augmentation via SMILES enumeration or atom order randomization.Moreover, the package contains a variety of dataset, sampler and dataloader classes for pytorch. These solve various tasks that can appear. The BucketBatchSampler devides the dataset into buckets, and randomly creates mini-batches from within each bucket. This way the mini-batches can be created of SMILES of approximate similar length and sequence padded can be kept at a minimum. This speeds up training.
For datasets that are too large to fit in memory, chunck based loading can be applied, and for data that needs pre-augmentation (e.g. slow Levenshtein augmentation), the epochs can be pre-created on disk.
## Prerequisites
Before you begin, ensure you have met the following requirements:
* Linux, Windows or macOS platforms are supported - as long as the dependencies are supported on these platforms.
* You have installed [anaconda](https://www.anaconda.com/) or [miniconda](https://docs.conda.io/en/latest/miniconda.html) with python 3.6 - 3.8
The tool has been developed on a Linux platform.
## Installation
### Dependencies
Depencies are listed in environment.yml file and can be installed in the conda environment, either during creation
```bash
conda env create -f environment.yml
```or updating an already activated environment
```bash
conda env update --file environment.yml
```### Installation with pip
```sh
git clone https://github.com/MolecularAI/pysmilesutils.gitcd pysmilesutils
pip install .
```pip can also install directly from github
```
python -m pip install git+https://github.com/MolecularAI/pysmilesutils.git
```Alternativly, the package can also be installed in developer mode, which leaves the source directory editable and the package still instantly usable without the need to reinstall after every change.
```bash
pip install -e .
```### Testing
Post-installation the package can be tested with pytest.```bash
cd testspytest
```It is also recommended to run through the scripts in the example directory.
## Documentation
Sphinx documentation can be build with e.g. the make.sh in the "docs" directory
```bash
./docs/make.sh
```Moreover, the examples directory contains some #%% delimited notebooks that show how to use the various classes. These notebooks can be paired with jupyter notebooks using the jupytext extension, and is also VScode compatible. #%% delimited scripts are much more GIT friendly than jupyter notebooks.
The training example contains a full example on how to train a transformer model using different approaches for handling the conversion of the SMILES in the mini-batches.
## Contributing
We welcome contributions, in the form of issues or pull requests.
If you have a question or want to report a bug, please submit an issue.
To contribute with code to the project, follow these steps:
1. Fork this repository.
2. Create a branch: `git checkout -b `.
3. Make your changes and commit them: `git commit -m ''`
4. Push to the remote branch: `git push`
5. Create the pull request.Please use ``black`` package for formatting, and follow ``pep8`` style guide.
## Contributors
* Esben Jannik Bjerrum, [email protected]
* Samuel Genheden, [email protected]
* Christos Kannas, [email protected]
* Tobias Rastemo, [email protected]## License
The software is licensed under the Apache 2.0 license (see LICENSE file), and is free and provided as-is.
## References
Framework:
* Bjerrum, E., Rastemo, T., Irwin, R., Kannas, C. & Genheden, S. PySMILESUtils – Enabling deep learning with the SMILES chemical language. ChemRxiv (2021). [doi:10.33774/chemrxiv-2021-kzhbs](https://doi.org/10.33774/chemrxiv-2021-kzhbs)
Augmentation:
* Bjerrum, E. SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules. Arxiv (2017) [https://arxiv.org/abs/1703.07076](https://arxiv.org/abs/1703.07076)