{"id":13958636,"url":"https://github.com/ChenDdon/AGBTcode","last_synced_at":"2025-07-21T00:31:36.430Z","repository":{"id":187328490,"uuid":"310482226","full_name":"ChenDdon/AGBTcode","owner":"ChenDdon","description":null,"archived":false,"fork":false,"pushed_at":"2023-08-29T18:13:48.000Z","size":4514,"stargazers_count":27,"open_issues_count":3,"forks_count":6,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-11-28T02:34:57.290Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ChenDdon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-11-06T03:32:23.000Z","updated_at":"2024-05-17T07:42:23.000Z","dependencies_parsed_at":null,"dependency_job_id":"29b247ef-820a-4435-9067-734da0d0b6f4","html_url":"https://github.com/ChenDdon/AGBTcode","commit_stats":null,"previous_names":["chenddon/agbtcode"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/ChenDdon/AGBTcode","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ChenDdon%2FAGBTcode","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ChenDdon%2FAGBTcode/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ChenDdon%2FAGBTcode/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ChenDdon%2FAGBTcode/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ChenDdon","download_url":"https://codeload.github.com/ChenDdon/AGBTcode/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ChenDdon%2FAGBTcode/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266221272,"owners_count":23894966,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-08T13:01:47.104Z","updated_at":"2025-07-21T00:31:31.415Z","avatar_url":"https://github.com/ChenDdon.png","language":"Python","readme":"# Algebraic Graph-assisted Bidirectional Transformers (AGBT)\n\nImplementation of the paper \"Algebraic Graph-assisted Bidirectional Transformers for Molecular Property Prediction\" by Dong Chen, Kaifu Gao, Duc Duy Nguyen, Xin Chen, Yi Jiang, Guo-Wei Wei\u003csup\u003e+\u003c/sup\u003e and Feng Pan\\*\n\n---\n\n![model_framework](./model_framework.png)\n\n---\n\n## \u003cspan style=\"color: red;\"\u003eNew interface (update: 08-2023) \u003c/span\u003e \n\n\u003e New Feature: If you're just trying to get molecular features based on deep networks, the following process eliminates the need to compile the code base.\n\nPython Dependencies (Higher version should be find):\n  - python                    3.9.12\n  - pytorch                   1.13.1\n  - fairseq                   0.12.2\n  - numpy                     1.21.5\n\n\u003e Once python's dependece is satisfied, there is no need to compile the entire code base.\n\n```shell\n# Pre-trained model\nwget https://weilab.math.msu.edu/AGBT_Source/checkpoint_pretrained.pt ./examples/models/\n\n# generate the feature\n# '--feature_type': 'bos' for the begin of sequence symbol's embedding; 'avg' for the average of the all symbols embedding.\npython \"./agbt_pro/generate_bt_fps_new.py\" --model_name_or_path \"./examples/models/\" --checkpoint_file \"checkpoint_best.pt\" --smi_file \"./examples/data/example_train_canonical.smi\" --save_feature_path \"./examples/BT_FPs/examples_bt_train_features.npy\" --feature_type bos\n```\n\nNew interface is also updated in [here](https://github.com/WeilabMSU/PretrainModels)\n\n---\n\n---\n\n## Requirments\n\nOS Requirements\n- CentOS Linux 7 (Core)\n\nPython Dependencies\n- setuptools (\u003e=18.0)\n- python (\u003e=3.7)\n- pytorch (\u003e=1.2)\n- rdkit (2020.03)\n- biopandas (0.2.7)\n- numpy (1.17.4)\n- scikit-learn (0.23.2)\n- scipy (1.5.2)\n- pandas (0.25.3)\n\n\n## Installation Guide\n\nInstall from Github\n\n```shell\ngit clone https://github.com/ChenDdon/AGBTcode.git\ncd AGBTcode/agbt_pro\nmkdir agbt_pro\nmkdir agbt_pro/fairseq\nmkdir agbt_pro/fairseq/data\npython setup.py build_ext --inplace\nmv ./agbt_pro/fairseq/data/* ./fairseq/data/\n```\n\nwhich should install in about 60 seconds.\n\n## Downloading Pre-trained Models\n\nPre-trained model is publicly available.\n\n```shell\n# Pre-trained model\nwget https://weilab.math.msu.edu/AGBT_Source/checkpoint_pretrained.pt ./examples/models/\n\n```\n\n## Pre-training settings\n\nThe pre-training dataset used in this work is ChEMBL26, which is available at chembl.gitbook.io/chembl-interface-documentation/downloads.\n\nThere are 1936342 samples including in the ChEMBL26 dataset. We divided the dataset into a training set (1926342) and a valid set (10000) in this work. \n\n```shell\n# Suppose the file name of the pre-training data are chembl26_train.smi and chembl26_valid.smi\n# First pre-processing\npython \"./agbt_pro/preprocess.py\" --only-source --trainpref \"chembl26_train.smi\" --validpref \"chembl26_valid.smi\" --destdir \"./examples/data/chembl26/\" --trainoutf \"train\" --validoutf \"valid\"  --workers 20 --file-format smiles\n\n# Pre-training command\npython \"./agbt_pro/train.py\" \"./examples/data/chembl26/\" --train-subset \"train\" --valid-subset \"valid\" --save-dir \"./examples/models/\" --task masked_lm --arch roberta_base --encoder-attention-heads 8 --encoder-embed-dim 512 --encoder-ffn-embed-dim 1024 --encoder-layers 8 --dropout 0.1 --attention-dropout 0.1 --criterion masked_lm --sample-break-mode complete --tokens-per-sample 256 --skip-invalid-size-inputs-valid-test --optimizer adam --adam-betas '(0.9,0.999)' --adam-eps 1e-6 --clip-norm 0.0 --lr-scheduler polynomial_decay --lr 0.0001 --weight-decay 0.1 --warmup-updates 5000 --total-num-update 1000000 --max-update 1000000 --save-interval 100 --save-interval-updates 100000 --log-format simple --log-interval 2000 --max-sentences 64 --update-freq 2 --ddp-backend no_c10d --fp16 --reset-optimizer --reset-dataloader --reset-meters\n\n# the pre-trained model will save as ./examples/data/chembl26/checkpoint_best.pt\n```\n\n## Reproduction instructions\n\n- The generated AGBT-FPs are avaliable at https://weilab.math.msu.edu/AGBT_Source/AGBT_FPs.zip.\n\n```shell\nwget https://weilab.math.msu.edu/AGBT_Source/AGBT_FPs.zip\n```\n\n- The one of trained task-specific neural network-based model can be downloaded from https://weilab.math.msu.edu/AGBT_Source/downstream_nn_models.tar.gz. The GBDT and RF model can be obtained within 10 minutes, and the specific parameters are shown in the \"AGBT model parametrization\" section (Table S3) of the Supporting Information. To eliminate systematic errors in the machine learning models, for each machine learning algorithm, the consensus of the predicted values from 20 different models (generated with different random seeds) was taken for each molecule. Note that the consensus value here refers to the average of the predicted results from different models for each molecule of each specific training-test splitting.\n\n```shell\nwget https://weilab.math.msu.edu/AGBT_Source/downstream_nn_models.tar.gz\n```\n\n- All parameter settings for the training process can be referred to the \"AGBT model parametrization\" section in Supporting Information.\n\n\n## Customize task-specific AGBT-FPs\n\nFor users who want to build a new task-specific model from a set of molecules with corresponding properties, here we provide some scripts for generating AG-FPs, BT-FPs, and AGBT-FPs, respectively. By default, we use supervised learning-based strategy to fine-tune the pre-trained model. The example molecule includes the MOL2 file and the corresponding SMILES string. The following steps need to be performed on a platform that supports GPU computing.\n\n```shell\n# Generate Bidirectional Transformer-based Fingerprints (BT-FPs)\n\n# step 1, download the pre-trained model\nwget https://weilab.math.msu.edu/AGBT_Source/checkpoint_pretrained.pt ./examples/models/\n\n# step 2, pre-process input data (Binarize the input data to speed up the training.)\nmkdir \"./examples/data/input0\"\npython \"./agbt_pro/preprocess.py\" --only-source --trainpref \"./examples/data/example_train_canonical.smi\" --validpref \"./examples/data/example_valid_canonical.smi\" --destdir \"./examples/data/input0/\" --trainoutf \"train\" --validoutf \"valid\"  --workers 20 --file-format smiles --srcdict \"./examples/data/input0/dict.txt\"\n\n# step 3, fine-tuning the pre-trained model\nmkdir \"./examples/data/label\"\ncp \"./examples/data/example_train.label\" \"./examples/data/label/train.label\"\ncp \"./examples/data/example_valid.label\" \"./examples/data/label/valid.label\"\npython \"./agbt_pro/train.py\" \"./examples/data/\" --save-dir \"./examples/models/\" --train-subset train --valid-subset valid --restore-file \"./examples/models/checkpoint_pretrained.pt\" --task sentence_prediction --num-classes 1 --regression-target --init-token 0 --best-checkpoint-metric loss --arch roberta_base --bpe smi --encoder-attention-heads 8 --encoder-embed-dim 512 --encoder-ffn-embed-dim 1024 --encoder-layers 8 --dropout 0.1 --attention-dropout 0.1  --criterion sentence_prediction --max-positions 256 --truncate-sequence --skip-invalid-size-inputs-valid-test --optimizer adam --adam-betas '(0.9,0.999)' --adam-eps 1e-6 --clip-norm 0.0 --lr-scheduler polynomial_decay --lr 0.0001 --warmup-updates 500 --total-num-update 5000 --weight-decay 0.1 --max-update 5000 --log-format simple --reset-optimizer --reset-dataloader --reset-meters --no-epoch-checkpoints --no-last-checkpoints --no-save-optimizer-state --find-unused-parameters --log-interval 50 --max-sentences 64 --update-freq 2 --required-batch-size-multiple 1 --ddp-backend no_c10d --fp16 --max-epoch 5000\n\n# step 4, generate BT-FPs\nmkdir \"./examples/BT_FPs/\"\npython \"./agbt_pro/generate_bt_fps.py\" --model_name_or_path \"./examples/models/\" --checkpoint_file \"checkpoint_best.pt\" --data_name_or_path  \"./examples/data/\" --dict_file \"./examples/data/dict.txt\" --target_file \"./examples/data/example_train_canonical.smi\" --save_feature_path \"./examples/BT_FPs/examples_bt_train_features.npy\"\npython \"./agbt_pro/generate_bt_fps.py\" --model_name_or_path \"./examples/models/\" --checkpoint_file \"checkpoint_best.pt\" --data_name_or_path  \"./examples/data/\" --dict_file \"./examples/data/dict.txt\" --target_file \"./examples/data/example_valid_canonical.smi\" --save_feature_path \"./examples/BT_FPs/examples_bt_valid_features.npy\"\n```\n\n```shell\n## Generate Algebraic Graph-based Fingerprints (AG-FPs)\nmkdir \"./examples/AG_FPs/\"\n\n# step 1. Laplacian, Lorentz\npython \"./ag_pro/AG_main.py\" --dataset_prefix 'example_train' --dataset_path './examples/data/example_train_x_mol2' --dataset_id_path './examples/data/example_train.id' --save_feature_path_prefix './examples/AG_FPs' --matrix_type 'Lap' --kernal_type 'Lorentz' --kernal_tau 0.5 --kernal_parameter 10.0\npython \"./ag_pro/AG_main.py\" --dataset_prefix 'example_valid' --dataset_path './examples/data/example_valid_x_mol2' --dataset_id_path './examples/data/example_valid.id' --save_feature_path_prefix './examples/AG_FPs' --matrix_type 'Lap' --kernal_type 'Lorentz' --kernal_tau 0.5 --kernal_parameter 10.0\n# step 2. Laplacian, Exponential\npython \"./ag_pro/AG_main.py\" --dataset_prefix 'example_train' --dataset_path './examples/data/example_train_x_mol2' --dataset_id_path './examples/data/example_train.id' --save_feature_path_prefix './examples/AG_FPs' --matrix_type 'Lap' --kernal_type 'Exponential' --kernal_tau 0.5 --kernal_parameter 20.0\npython \"./ag_pro/AG_main.py\" --dataset_prefix 'example_valid' --dataset_path './examples/data/example_valid_x_mol2' --dataset_id_path './examples/data/example_valid.id' --save_feature_path_prefix './examples/AG_FPs' --matrix_type 'Lap' --kernal_type 'Exponential' --kernal_tau 0.5 --kernal_parameter 20.0\n```\n\nNote: The \"kernal_type\", \"kernal_tau\", and \"kernal_parameter\" can be modified according to the performance for a specific task.\n\n```shell\n## Generate algebraic graph-assisted bidirectional transformer-based Fingerprints (AGBT-FPs)\nmkdir \"./examples/AGBT-FPs/\"\npython \"./agbt_pro/feature_analysis.py\" --train_x_f1 \"./examples/AG_FPs/example_train_Lap_Lorentz_10.0_tau_0.5.npy\" --train_x_f2 \"./examples/AG_FPs/example_train_Lap_Exponential_20.0_tau_0.5.npy\" --train_x_f3 \"./examples/BT_FPs/examples_bt_train_features.npy\" --train_y \"./examples/data/example_train_y.npy\" --test_x_f1 \"./examples/AG_FPs/example_valid_Lap_Lorentz_10.0_tau_0.5.npy\" --test_x_f2 \"./examples/AG_FPs/example_valid_Lap_Exponential_20.0_tau_0.5.npy\" --test_x_f3 \"./examples/BT_FPs/examples_bt_valid_features.npy\" --test_y \"./examples/data/logp_FDA_label.npyexample_valid_y.npy\" --features_norm --save_folder_path \"./examples/AGBT-FPs/\" --n_estimators 10000 --n_workers -1 --max_depth 7 --min_samples_split 3 --random_seed 1234 --n_select_features 512\n```\n\nFor the data in the example, the entire process took less than 40 minutes.\n\n\u003ca name=\"Note\"\u003e\u003c/a\u003e\n## Note\n\n(Update: 2021-11) For those interested in pre-trained models **(BT-FPs)**, we provide three recent updates pre-trained models. These include those based on Chembl27(1.9 million), PubChem(over 0.1 billion), and ZINC (over 0.6 billion) datasets. The source code and models are publicly abailable at https://github.com/WeilabMSU/PretrainModels\n\n(Update: 2022-06) The data used in this work has been migrated. The Users can download the datasets at https://weilab.math.msu.edu/DataLibrary/3D/.\n\n\n## License\n\nAll codes released in this study is under the MIT License.\n","funding_links":[],"categories":["分子"],"sub_categories":["网络服务_其他"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FChenDdon%2FAGBTcode","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FChenDdon%2FAGBTcode","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FChenDdon%2FAGBTcode/lists"}