{"id":15974023,"url":"https://github.com/shihchengli/baseprop","last_synced_at":"2026-06-21T14:31:53.826Z","repository":{"id":250753096,"uuid":"835103319","full_name":"shihchengli/baseprop","owner":"shihchengli","description":"Baseline models for molecular property prediction","archived":false,"fork":false,"pushed_at":"2024-07-29T15:59:23.000Z","size":248,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-05T03:49:03.621Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shihchengli.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-29T07:01:32.000Z","updated_at":"2024-07-29T16:53:32.000Z","dependencies_parsed_at":"2024-07-29T23:54:58.158Z","dependency_job_id":null,"html_url":"https://github.com/shihchengli/baseprop","commit_stats":null,"previous_names":["shihchengli/baseprop"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shihchengli%2Fbaseprop","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shihchengli%2Fbaseprop/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shihchengli%2Fbaseprop/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shihchengli%2Fbaseprop/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shihchengli","download_url":"https://codeload.github.com/shihchengli/baseprop/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240331362,"owners_count":19784646,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-07T21:23:12.957Z","updated_at":"2026-06-21T14:31:53.820Z","avatar_url":"https://github.com/shihchengli.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Baseprop\nBaseline models for molecular property prediction. Currently, this package includes the GNN from “[Semi-supervised Classification with Graph Convolutional Networks](https://arxiv.org/abs/1609.02907)” and traditional MLP as model architectures. \n\n# Installing Baseprop\n```python\nconda create -n baseprop python=3.11\nconda activate baseprop\ngit clone https://github.com/shihchengli/baseprop.git\ncd baseprop\npip install -e .\n```\n\n# Using Baseprop\nBaseprop can used either by CLI or as a Python module.\n\n## CLI\nFour different types of CLI are supported: `train`, `predict`, `hpopt`, and `nestedCV`. Below are some examples and descriptions of the arguments for each job. More details about other arguments can be found in the modules under the CLI folder.\n### `train`: model training\n```bash\nbaseprop train \\\n--data-path tests/data/freesolv.csv \\\n--task-type regression \\\n--output-dir train_example \\\n--smiles-columns smiles \\\n--target-columns freesolv \\\n--save-smiles-splits \\\n--split-type cv \\\n--num-folds 5 \\\n--molecule-featurizers morgan_binary\n```\n* `--data-path`: Path to an input CSV file containing SMILES and the associated target values.\n* `--task-type`: Type of dataset. This determines the default loss function used during training. Defaults to regression.\n* `--output-dir`: Directory where training outputs will be saved. Defaults to 'CURRENT_DIRECTORY/chemprop_training/STEM_OF_INPUT/TIME_STAMP'.\n* `--smiles-columns`: The column names in the input CSV containing SMILES strings.\n* `--target-columns`: Name of the columns containing target values.\n* `--save-smiles-splits`: Save smiles for each train/val/test splits for prediction convenience later.\n* `--split-type`: Method of splitting the data into train/val/test (case insensitive).\n* `--num-folds`: Number of folds when performing cross validation.\n* `--molecule-featurizers`: Method(s) of generating molecule features to use as extra descriptors.\n\n### `predict`: model inference\n```bash\nbaseprop predict \\\n--test-path freesolv.csv \\\n--preds-path train_example/fold_0/test_preds.csv \\\n--target-columns freesolv \\\n--model-path train_example/fold_0/model_0/best.pt \\\n--molecule-featurizers morgan_binary\n```\n* `--test-path`: Path to an input CSV file containing SMILES.\n* `--preds-path`: Path to which predictions will be saved.\n* `--model-path`: Location of checkpoint(s) or model file(s) to use for prediction.\n\n### `hpopt`: hyperparameters optimization\n```bash\nbaseprop cli \\\n--data-path freesolv.csv \\\n--task-type regression \\\n--smiles-columns smiles \\\n--target-columns freesolv \\\n--raytune-num-samples 5 \\\n--raytune-temp-dir $RAY_TEMP_DIR \\\n--raytune-num-cpus 40 \\\n--raytune-num-gpus 2 \\\n--raytune-max-concurrent-trials 2 \\\n--search-parameter-keywords depth ffn_num_layers hidden_channels ffn_hidden_dim dropout lr batch_size \\\n--hyperopt-random-state-seed 42 \\\n--hpopt-save-dir $results_dir\n```\n* `--raytune-num-samples`: Passed directly to Ray Tune TuneConfig to control number of trials to run.\n* `--raytune-temp-dir`: Passed directly to Ray Tune init to control temporary director.\n* `--raytune-num-cpus`: Passed directly to Ray Tune init to control number of CPUs to use.\n* `--raytune-num-gpus`: Passed directly to Ray Tune init to control number of GPUs to use.\n* `--raytune-max-concurrent-trials`: Passed directly to Ray Tune TuneConfig to control maximum concurrent trials.\n* `--search-parameter-keywords`: The model parameters over which to search for an optimal hyperparameter configuration.\n* `--hyperopt-random-state-seed`: Passed directly to HyperOptSearch to control random state seed.\n* `--hpopt-save-dir`: Directory to save the hyperparameter optimization results.\n\n### `nestedCV`: nested cross-validation (CV)\n```bash\nbaseprop nestedCv \\\n--data-path freesolv.csv \\\n--task-type regression \\\n--smiles-columns smiles \\\n--target-columns freesolv \\\n--raytune-num-samples 20 \\\n--raytune-temp-dir $RAY_TEMP_DIR \\\n--raytune-num-cpus 40 \\\n--raytune-num-gpus 2 \\\n--raytune-max-concurrent-trials 2 \\\n--search-parameter-keywords depth ffn_num_layers hidden_channels ffn_hidden_dim dropout lr batch_size \\\n--hyperopt-random-state-seed 42 \\\n--hpopt-save-dir $results_dir \\\n--split-type cv \\\n--num-folds 5\n```\n**Note**: The number of CV folds in the outer and inner loops is the same as `--num-folds`.\n\n# Python Module\nBaseprop can also be used as a Python module to run baseline benchmarks or more complicated jobs. For example, there is a [notebook](https://github.com/shihchengli/baseprop/blob/main/examples/active_learning.ipynb) for active learning under the examples folder.\n\n# Relationship to Chemprop\nBaseprop is very similar to Chemprop, which uses a directed message passing (D-MPNN) neural network as the GNN model for chemical property prediction. Here, the GNN from “[Semi-supervised Classification with Graph Convolutional Networks](https://arxiv.org/abs/1609.02907)” is used as the baseline in the package. Additionally, the traditional MLP method can also be used with `--features-only` and `--molecule-featurizers` to only utilize fingerprints as input for the MLP. I ([@shihchengli](https://github.com/shihchengli)) am also a developer of Chemprop, so I adopted most of the code from Chemprop. This approach ensures a fair comparison between the model performance benchmark with D-MPNN and the other baselines implemented in this package.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshihchengli%2Fbaseprop","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshihchengli%2Fbaseprop","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshihchengli%2Fbaseprop/lists"}