{"id":20690249,"url":"https://github.com/merck/deepneuralnet-qsar","last_synced_at":"2025-09-12T10:34:29.050Z","repository":{"id":90243140,"uuid":"81585711","full_name":"Merck/DeepNeuralNet-QSAR","owner":"Merck","description":null,"archived":false,"fork":false,"pushed_at":"2018-10-24T20:07:27.000Z","size":9491,"stargazers_count":64,"open_issues_count":1,"forks_count":27,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-03-29T16:51:14.282Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Merck.png","metadata":{"files":{"readme":"README.txt","changelog":null,"contributing":null,"funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2017-02-10T16:48:40.000Z","updated_at":"2025-01-31T07:59:04.000Z","dependencies_parsed_at":"2023-05-11T13:35:39.180Z","dependency_job_id":null,"html_url":"https://github.com/Merck/DeepNeuralNet-QSAR","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Merck%2FDeepNeuralNet-QSAR","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Merck%2FDeepNeuralNet-QSAR/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Merck%2FDeepNeuralNet-QSAR/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Merck%2FDeepNeuralNet-QSAR/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Merck","download_url":"https://codeload.github.com/Merck/DeepNeuralNet-QSAR/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250284107,"owners_count":21405288,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-16T23:12:23.876Z","updated_at":"2025-04-22T16:55:39.695Z","avatar_url":"https://github.com/Merck.png","language":"Python","readme":"===================================================================\n============    DeepNeuralNet_QSAR Documentation     ==============\n===================================================================\n\nAuthors: Yuting Xu, Junshui Ma. \n\nContact: yuting.xu@merck.com, junshui_ma@merck.com.\n\nAffiliation: Merck Biometrics Research, Merck Sharp \u0026 Dohme Corp. a subsidiary of Merck \u0026 Co., Inc., Kenilworth, NJ, USA.\n\nDate: 02/07/2017\n\nAcknowledgement: \n\tThis set of codes were developed based on George Dahl's Kaggle codes in Dec. 2012.\n\nIf you use the DeepNeuralNet_QSAR for scientific work that gets published, you should include in that publication a citation of the paper below:\n\nXu, Yuting, Junshui Ma, Andy Liaw, Robert P. Sheridan, and Vladimir Svetnik. \"Demystifying Multitask Deep Neural Networks for Quantitative Structure–Activity Relationships.\" Journal of chemical information and modeling 57, no. 10 (2017): 2490-2504.\n\n\n===================================================================\nBasic info.\n===================================================================\n\nSystem requirements:\n* Python 2.7+\n* Required Python Modules: \n  - Python Modules installed by default: sys, os, argparse, itertools, gzip, time\n  - General Python Modules:\tnumpy, scipy.sparse \n  - Special Python Modules: gnumpy, cudamat (if use GPU) or npmat (if use multiplec-core CPU)\n* CUDA toolkit: a prerequisite of cudamat Python Module.\n\n\nInstallation of Special Python Modules:\n\t* gnumpy: http://www.cs.toronto.edu/~tijmen/gnumpy.html\n\t* npmat: http://www.cs.toronto.edu/~ilya/npmat.py\n\t* cudamat: https://github.com/cudamat/cudamat\n\nNote: \n  - Modules \"gnumpy\" and \"npmat\" are also provided in this distribution.\n  - If you have not GPU card or have problem installing cudamat module, the npmat.py module will use multiplec-core CPU to simulate the GPU computing. \n  - Create a directory for this moduel of DeepNeuralNet_QSAR, and keep all the python scripts in that directory. \n\nUsage:\n* Start a commandline-window (in windows) or a terminal (in linux), and run the python scripts. Please refer to details below.\n\n\n===================================================================\nBrief explaination of all python files\n===================================================================\nAll the files are listed in alphabetical order, not ordered by importance.\nPlease find more detailed comments of all individual functions inside each python file.\n\n[activationFunctions.py]\n\tDefine several classes of common activiation functions, such as ReLU/Linear/Sigmoid, along with their derivation or error function (if used for ouput layer).\n\tUsed by [dnn.py]\n\n[counter.py]\n\tUtilize sys.stderr to produce progress bar for each training epoch.\n\tInclude several different classes of progress bar, but only \"Progress\" and \"DummyProgBar\" are used.\n\tUsed by [dnn.py]\n\n[DeepNeuralNetPredict.py]\n\tFor making predictions for new compound structure with a single-task/multi-task DNN, which is trained by DeepNeuralNetTrain.py or DeepNeuralNetTrain_dense.py. \n\n[DeepNeuralNetTrain.py]\n\tFor training a multi-task/single-task DNN with sparse QSAR dataset(s), accepts raw csv datasets or processed npz datasets.\n\n[DeepNeuralNetTrain_dense.py]\n\tFor training a multi-task DNN with dense QSAR dataset(s), accepts raw csv datasets or processed npz datasets.\n\t\n[dnn.py]\n\tKey components of a simple feed forward neural network.\n\tUsed by [DeepNeuralNetTrain.py], [DeepNeuralNetPredict.py], [DeepNeuralNetTrain_multi.py] and [DeepNeuralNetPredict_multi.py]\n\n[DNNSharedFunc.py]\n\tA group of assistant functions, such as calculating R-squared, writing predictions into file. \n\tUsed by many other files in the package.\n\n[gnumpy.py]\n\tA simple python module for GPU computing, the \"GPU-version\" of numpy module. \n\n[npmat.py]\n\tA simple python module which is required by gnumpy.py for the simulation mode. \n\tIf failed to import cudamat, using npmat (CPU computing) instead. \n\n[processData_sparse.py], [processData_dense.py]\t\n\tPre-processing a group of raw csv QSAR data sets(either sparse or dense) to sparse-matrix python file format (save as *.npz), \n\tto facilitate later use.\n\tContains many data-manipulation functions used by other files in the package.\n\t\n\t\n===================================================================\nHow to use - Example scripts\n===================================================================\n0) Prepare input datasets\n\t[sparse datasets]\n\t* Arrange all the datasets as examples in \"data_sparse\" folder.\n\t* Example #1 (It is a subset of three tasks from the 15 Kaggle datasets): \n\t\t - Folder name: data_sparse\n\t\t - Contains several datasets, each has training set and test set: \n\t\t\t\tMETAB_training.csv METAB_test.csv   \n\t\t\t\tOX1_training.csv   OX1_test.csv   \n\t\t\t\tTDI_training.csv   TDI_test.csv  \n\t* Example #2 (It is a single task selected from Kaggle datasets): \n\t\t - Folder name: data_sparse_single\n\t\t - Contains one pair of training set and test set:\n\t\t\t\tMETAB_training.csv METAB_test.csv   \n\n\t[dense datasets]\n\t* Arrange all the datasets as examples in \"data_dense_raw\" folder.\n\t* Example (It is a subsample from CYP datasets, which has 3 tasks): \n\t\t - Folder name: data_dense\n\t\t - Contains two datasets, one training set and one test set: \n\t\t\t\ttraining.csv  test.csv  \n\n1) Pre-process data (Optional, can be skipped.)\n\t* preprocess sparse format datasets: create a new folder \"data_sparse\" under the working directory to save processed data.\n\t\tpython processData_sparse.py data_sparse data_sparse_processed\n\n\t* preprocess dense format datasets: create a new folder \"data_dense\" under the working directory to save processed data, need to tell how many tasks are there in the dense dataset, such as \"3\" in the example datasets. \n\t\tpython processData_dense.py data_dense data_dense_processed 3\n\n2) Train a single-task DNN for one QSAR task\n\n\tDefault transformation of inputs is log; activation function is ReLU, minibatch size 128....\n\n\tThe key parameters that need to be specify by user: \n\t - seed: random seed for the program. It is optional but better to be given for reproducibility. \n\t - CV: (optional) proportation of cross-validation subset which randomly sampled from training set\n\t - test: (optional) whether to use the corresponding external test set for checking performance on test set during training.\n\t - hid: DNN structure, specify the number of nodes at each layer. \n\t - dropouts: the drop out probability for each layer, to prevent over-fitting. \n\t - epochs: number of epochs for training\n\t - data: path to the folder which contains a single QSAR task data, could contain raw csv file or processed npz file\n\t - the last argument: where you want to save the trained model, if the folder doesn't exists it'll be created automatically\n\n\t* Example: use .csv raw data to train a single-task DNN for METAB, each corresponding processed .npz files will be automatically save to input data path\n\t\tpython DeepNeuralNetTrain.py --seed=0 --CV=0.4 --test --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_sparse_single models/METAB_single\n\n\t* Example: use .npz processed data to train a single-task DNN for METAB (recommended, loading data faster than raw data)\n\tParameters are the same as above. The processed datasets in folder \"data_sparse_single\" is created in last step.\n\t\tpython DeepNeuralNetTrain.py --seed=0 --CV=0.4 --test --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_sparse_single models/METAB_single\n\n\t* Example: Without the optional 'CV' and 'test' arguments.\n\t\tpython DeepNeuralNetTrain.py --seed=0 --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_sparse_single models/METAB_single\n\n3) Prediction with a single-task DNN\n\t\n\tThe key parameters that need to be specify by user: \n\t - model: the path to previous trained model folder, e.g. the \"models/METAB_single\" from step 2). \n\t - data: path to the folder which contains a single QSAR task data, could contain raw csv file or processed npz file\n\t - label: whether the \"test\" dataset have true label. Default is 0, but in this example it has true label. \n\t - rep: (optional) number of dropout prediction rounds. Default is 0, means don't perform dropout prediction.\n\t - seed: random seed for the program, useful for dropout prediction. Optional but better to be given for reproducibility. \n\t - result: (optional) specify where to save the prediction results. Default is the same as model folder.\n\n\t* Example: use the previous trained single DNN model for METAB to perform prediction for its test data\n\t\tpython DeepNeuralNetPredict.py --seed=0 --label=1 --rep=10 --data=data_sparse_single --model=models/METAB_single --result=predictions/METAB_single\n\n\t* Example: Without the optional 'rep' and 'PredictResultPath':\n\t\tpython DeepNeuralNetPredict.py --label=1 --data=data_sparse_single --model=models/METAB_single\n\n4) Train a multi-task DNN for the sparse datasets\n\tNeed to use the processed datasets but not raw datasets.\n\tParameters that are different from single-task DNN:\n\t - data: path to the data folder that stores all the QSAR datasets\n\t (Below are optional)\n\t - mbsz: the minibatch size, default is 20, but for multi-task it may be modified to achieve better results\n\t - keep: the datasets to keep in the model, if don't want to include all datasets in the 'data' folder\n\t - watch: if use internal cross-validation set or external test set, choose to monitor the MSE and R-squared for certain task\n\t - reducelearnRateVis: sometimes reduce the learning rate of the first layer helps the training process to converge better\n\n\t* Example: a multi-task DNN to model all the three sparse datasets: METAB, OX1, TDI\n\t\tpython DeepNeuralNetTrain.py --seed=0 --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=5 --data=data_sparse models/multi_sparse_1\n\t\n\t* Example: load the previous trained model and continue the training process for more epochs. \n\t\tpython DeepNeuralNetTrain.py --seed=0 --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_sparse --loadModel=models/multi_sparse_1 models/multi_sparse_continue\n\n\t* Example: with more optional parameters, keep only METAB and OX1 tasks and monitor OX1 task performance\n\t\tpython DeepNeuralNetTrain.py --seed=0 --CV=0.4 --test --mbsz=30 --keep=METAB --keep=OX1 --watch=OX1 --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_sparse models/multi_sparse_2\n\n5) Prediction with multi-task DNN for the sparse datasets\n\tThe parameter settings are the same as single-task DNN for sparse dataset. See step 3).\n\tOnly difference:\n\t- data: path to the data folder that stores all the processed datasets (including test datasets).\n\n\t* Example: prediction for all the three sparse datasets with the model trained in previous step, save results to model folder:\n\t\tpython DeepNeuralNetPredict.py --label=1 --data=data_sparse --model=models/multi_sparse_1\n\n\t* Example: prediction with the model for METAB and OX1, trained in previous step, with dropout prediction, and save result to another folder.\n\t\tpython DeepNeuralNetPredict.py --label=1 --seed=0 --rep=10 --data=data_sparse --model=models/multi_sparse_2 --result=predictions/multi_sparse_2\n\n6) Train a multi-task DNN for the dense datasets\n\tMost of the parameter settings are the same as multi-task DNN for sparse datasets\n\tDifference: use integer parameters for the 'keep' and 'watch' arguments\n\tThe key parameters that need to be specify by user: \n\t - numberOfOutputs: number of QSAR task output columns in the raw training set (.csv)\n\n\t* Example: keep only the first two output tasks and monitor the first output during training process, with internal cross-validation set and external test set, using raw data\n\t\tpython DeepNeuralNetTrain_dense.py --numberOfOutputs=3 --CV=0.4 --test --keep=0_1 --watch=0 --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_dense models/multi_dense_1\n\n\t* Example: Without the optional arguments, using pre-processed data\n\tNote: for processed data, don't need to specify \"--numberOfOutputs=3\"\n\t\tpython DeepNeuralNetTrain_dense.py --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_dense_processed models/multi_dense_2\n\n7) Prediction with multi-task DNN for the dense datasets\n\tParameter settings are the same as prediction for sparse datasets\n\n\t* Example: Prediction using trained DNN from previous step\n\t\tpython DeepNeuralNetPredict.py --label=1 --dense --data=data_dense --model=models/multi_dense_1 --result=predictions/multi_dense_1\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmerck%2Fdeepneuralnet-qsar","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmerck%2Fdeepneuralnet-qsar","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmerck%2Fdeepneuralnet-qsar/lists"}