https://github.com/thomas0809/textreact
Predictive Chemistry Augmented with Text Retrieval
https://github.com/thomas0809/textreact
Last synced: about 1 year ago
JSON representation
Predictive Chemistry Augmented with Text Retrieval
- Host: GitHub
- URL: https://github.com/thomas0809/textreact
- Owner: thomas0809
- Created: 2023-02-23T19:08:01.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2024-02-20T19:34:44.000Z (over 2 years ago)
- Last Synced: 2025-03-25T22:36:11.991Z (about 1 year ago)
- Language: Python
- Homepage:
- Size: 11.2 MB
- Stars: 21
- Watchers: 1
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# TextReact
This repository contains the code for [TextReact](https://aclanthology.org/2023.emnlp-main.784/), a novel method that directly augments
predictive chemistry with text retrieval.

```
@inproceedings{TextReact,
author = {Yujie Qian and
Zhening Li and
Zhengkai Tu and
Connor W. Coley and
Regina Barzilay},
title = {Predictive Chemistry Augmented with Text Retrieval},
booktitle = {Proceedings of the 2023 Conference on Empirical Methods in Natural
Language Processing, {EMNLP} 2023, Singapore, December 6-10, 2023},
pages = {12731--12745},
publisher = {Association for Computational Linguistics},
year = {2023},
url = {https://aclanthology.org/2023.emnlp-main.784}
}
```
## Requirements
We implement the code with `torch==1.11.0`, `pytorch-lightning==2.0.0`, and `transformers==4.27.3`.
To reproduce our experiments, we recommend creating a conda environment with the same dependencies:
```bash
conda env create -f environment.yml -n textreact
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
```
## Data
Run the following commands to download and unzip the preprocessed datasets:
```
git clone https://huggingface.co/datasets/yujieq/TextReact data
cd data
unzip '*'
```
## Training Scripts
TextReact consists of two modules: SMILES-To-text retriever and
text-augmented predictor. This repository only contains the code for
training the predictor, while the code for the retriever is available in
a separate repository: https://github.com/thomas0809/tevatron.
The training scripts are located under [`scripts`](scripts):
* [`train_RCR.sh`](scripts/train_RCR.sh) trains a model for reaction condition recommendation (RCR)
on the random split of the USPTO dataset.
* [`train_RetroSyn_tf.sh`](scripts/train_RetroSyn_tf.sh) trains a template-free model for retrosynthesis
on the random split of the USPTO-50K dataset.
* [`train_RetroSyn_tb.sh`](scripts/train_RetroSyn_tb.sh) trains a template-based model for retrosynthesis
on the random split of the USPTO-50K dataset.
In addition, [`train_RCR_TS.sh`](scripts/train_RCR_TS.sh), [`train_RetroSyn_tf_TS.sh`](scripts/train_RetroSyn_tf_TS.sh)
and [`train_RetroSyn_tb_TS.sh`](scripts/train_RetroSyn_tb_TS.sh) train the corresponding models
on the time-based split of the dataset.
If you're working on a distributed file system, it is recommended to
add to the script a `--cache_path` option specifying a local path to reduce network time.
To run the script `scripts/train_MODEL.sh`, run the following command at the root of the folder:
```
bash scripts/train_MODEL.sh
```
At the end of training, two dictionaries are printed with the top-k test accuracies.
The first one corresponds to retrieving from the full corpus
and the second one corresponds to retrieving from the gold-removed corpus.
Models and test predictions are stored under the path specified by the `SAVE_PATH` variable in the script.
* `best.ckpt` is the checkpoint with the highest validation accuracy so far, whereas
* `last.ckpt` is the last checkpoint.
* `prediction_test_0.json` contains the test predictions when retrieving from the full corpus.
* `prediction_test_1.json` contains the predictions when retrieving from the gold-removed corpus.