https://github.com/akensert/molcraft
Generative deep learning for molecules using transformers.
https://github.com/akensert/molcraft
computational-chemistry deep-learning machine-learning molecules smiles transformers
Last synced: about 1 year ago
JSON representation
Generative deep learning for molecules using transformers.
- Host: GitHub
- URL: https://github.com/akensert/molcraft
- Owner: akensert
- License: mit
- Created: 2025-01-24T13:08:26.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-07T18:17:40.000Z (over 1 year ago)
- Last Synced: 2025-02-07T19:24:14.110Z (over 1 year ago)
- Topics: computational-chemistry, deep-learning, machine-learning, molecules, smiles, transformers
- Language: Python
- Homepage:
- Size: 24.4 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README

**Transformers** with **TensorFlow** and **Keras**. Focused on **Molecule Generation** and **Chemistry Predictions**.
> [!NOTE]
> In progress.
## Highlights
Aims to implement efficient models, samplers and \[soon\] reinforcement learning for [SMILES](https://en.wikipedia.org/wiki/Simplified_Molecular_Input_Line_Entry_System) generation and optimization.
- [Models](https://github.com/akensert/molcraft/blob/main/molcraft/models.py) / [Layers](https://github.com/akensert/molcraft/blob/main/molcraft/layers.py)
- Implements **key-value caching** for efficient autoregression
- [Samplers](https://github.com/akensert/molcraft/blob/main/molcraft/samplers.py)
- Samples [Models](https://github.com/akensert/molcraft/blob/main/molcraft/models.py) for next tokens
- Can **generate** a batch of **sequences** in parallel **non-eagerly**
- Can generate a batch of sequences based on **initial sequences of varying lengths**
- [Tokenizers](https://github.com/akensert/molcraft/blob/main/molcraft/tokenizers.py)
- Tokenizes data input for [Models](https://github.com/akensert/molcraft/blob/main/molcraft/models.py)
- Can be **adapted** to data via **tokenizer.adapt(ds)** to build vocabulary
- Can be added as a layer to **keras.Sequential**
- Can both **tokenize** and **detokenize** data
## Code Examples
```python
import tensorflow as tf
import keras
import random
from molcraft import tokenizers
from molcraft import models
from molcraft import samplers
filename = './data/zinc250K.txt' # replace this with actual path
with open(filename, 'r') as fh:
smiles = fh.read().splitlines()
random.shuffle(smiles)
# Adapt tokenizer (create vocabulary)
tokenizer = tokenizers.SMILESTokenizer(add_bos=True, add_eos=True)
tokenizer.adapt(smiles)
# Build dataset (input pipeline)
ds = tf.data.Dataset.from_tensor_slices(smiles)
ds = ds.shuffle(8192)
ds = ds.batch(256)
ds = ds.map(tokenizer)
ds = ds.map(lambda x: (x[:, :-1], x[:, 1:]))
ds = ds.prefetch(-1)
# Build, compile, and fit model
model = models.TransformerDecoder(
num_layers=4,
num_heads=8,
embedding_dim=512,
intermediate_dim=1024,
vocabulary_size=tokenizer.vocabulary_size,
sequence_length=tokenizer.sequence_length,
dropout=0,
)
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=3e-4),
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True)
)
model.fit(ds, epochs=1)
# Generate 32 novel SMILES with sampler
sampler = samplers.TopKSampler(model, tokenizer)
smiles = sampler.sample([''] * 32)
```
## Installation
> [!NOTE]
> Project is under development, hence incomplete and subject to breaking changes.
For GPU users:
```
git clone git@github.com:akensert/molcraft.git
pip install -e .[gpu]
```
For CPU users:
```
git clone git@github.com:akensert/molcraft.git
pip install -e .
```