{"id":15646293,"url":"https://github.com/jacksonburns/astartes","last_synced_at":"2025-08-21T06:31:22.721Z","repository":{"id":40362601,"uuid":"484260856","full_name":"JacksonBurns/astartes","owner":"JacksonBurns","description":"Better Data Splits for Machine Learning","archived":false,"fork":false,"pushed_at":"2024-04-03T17:37:11.000Z","size":21390,"stargazers_count":49,"open_issues_count":15,"forks_count":2,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-05-14T00:08:38.016Z","etag":null,"topics":["ai","data-science","machine-learning","ml","python","sampling"],"latest_commit_sha":null,"homepage":"https://jacksonburns.github.io/astartes/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JacksonBurns.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2022-04-22T01:48:06.000Z","updated_at":"2024-05-13T01:30:30.000Z","dependencies_parsed_at":"2023-10-11T18:44:32.064Z","dependency_job_id":"5710d820-a951-4b16-99c3-3fcd4bd2065d","html_url":"https://github.com/JacksonBurns/astartes","commit_stats":{"total_commits":560,"total_committers":7,"mean_commits":80.0,"dds":0.4267857142857143,"last_synced_commit":"09aa54bbba7063ff8a94f1dfa8f71adc7c07688f"},"previous_names":[],"tags_count":15,"template":false,"template_full_name":"JacksonBurns/blank-python-project","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JacksonBurns%2Fastartes","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JacksonBurns%2Fastartes/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JacksonBurns%2Fastartes/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JacksonBurns%2Fastartes/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JacksonBurns","download_url":"https://codeload.github.com/JacksonBurns/astartes/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230367780,"owners_count":18215325,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","data-science","machine-learning","ml","python","sampling"],"created_at":"2024-10-03T12:12:19.676Z","updated_at":"2025-08-21T06:31:22.712Z","avatar_url":"https://github.com/JacksonBurns.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003eastartes\u003c/h1\u003e \n\u003ch2 align=\"center\"\u003e\u003cem\u003e(as-tar-tees)\u003c/em\u003e\u003c/h2\u003e\n\u003ch3 align=\"center\"\u003eTrain:Validation:Test Algorithmic Sampling for Molecules and Arbitrary Arrays\u003c/h3\u003e\n\n\u003cp align=\"center\"\u003e  \n  \u003cimg alt=\"astarteslogo\" src=\"https://raw.githubusercontent.com/JacksonBurns/astartes/main/astartes_logo.png\"\u003e\n\u003c/p\u003e \n\u003cdiv align=\"center\"\u003e\n  \u003ctable\u003e\n    \u003ccaption\u003e\u003cp style=\"font-weight:bold\"\u003eStatus Badges\u003c/p\u003e\u003c/caption\u003e\n    \u003ctr\u003e\n      \u003cth\u003eUsage\u003c/th\u003e\n      \u003cth\u003eContinuous Integration\u003c/th\u003e\n      \u003cth\u003eRelease\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd\u003e\u003cimg alt=\"PyPI - Python Version\" src=\"https://img.shields.io/pypi/pyversions/astartes?style=plastic\"\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003cimg alt=\"Reproduce Paper\" src=\"https://github.com/JacksonBurns/astartes/actions/workflows/reproduce_paper.yml/badge.svg?branch=main\u0026event=schedule\"\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003ca href=\"https://github.com/pyOpenSci/software-review/issues/120\"\u003e\u003cimg src=\"https://tinyurl.com/y22nb8up\" alt=\"pyOpenSci approved\" /\u003e\u003c/a\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd\u003e\u003cimg alt=\"PyPI - License\" src=\"https://img.shields.io/github/license/JacksonBurns/astartes\"\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003cimg alt=\"Test Status\" src=\"https://github.com/JacksonBurns/astartes/actions/workflows/ci.yml/badge.svg?branch=main\u0026event=schedule\"\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003ca href=\"https://doi.org/10.5281/zenodo.8147205\"\u003e\u003cimg src=\"https://zenodo.org/badge/DOI/10.5281/zenodo.8147205.svg\" alt=\"DOI\"\u003e\u003c/a\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd\u003e\u003cimg alt=\"PyPI - Total Downloads\" src=\"https://static.pepy.tech/personalized-badge/astartes?period=total\u0026units=none\u0026left_color=grey\u0026right_color=brightgreen\u0026left_text=Lifetime%20Downloads\"\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003ca alt=\"Documentation Status\"\u003e\u003cimg src=\"https://github.com/JacksonBurns/astartes/actions/workflows/gen_docs.yml/badge.svg\"\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003cimg alt=\"PyPI\" src=\"https://img.shields.io/pypi/v/astartes\"\u003e \u003cimg alt=\"conda-forge version\" src=\"https://img.shields.io/conda/vn/conda-forge/astartes.svg\"\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd\u003e\u003cimg alt=\"GitHub Repo Stars\" src=\"https://img.shields.io/github/stars/JacksonBurns/astartes?style=social\"\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003ca href=\"https://www.repostatus.org/#active\"\u003e\u003cimg src=\"https://www.repostatus.org/badges/latest/active.svg\" alt=\"Project Status: Active – The project has reached a stable, usable state and is being actively developed.\" /\u003e\u003c/a\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003ca href=\"https://joss.theoj.org/papers/8a9cfc71d6f75410b06510a646d5f783\"\u003e\u003cimg src=\"https://joss.theoj.org/papers/8a9cfc71d6f75410b06510a646d5f783/status.svg\"\u003e\u003c/a\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/table\u003e\n\u003c/div\u003e\n\n\n## Online Documentation\nFollow [this link](https://JacksonBurns.github.io/astartes/) for a nicely-rendered version of this README along with additional tutorials for [moving from train_test_split in sklearn to astartes](https://jacksonburns.github.io/astartes/sklearn_to_astartes.html).\nKeep reading for a installation guide and links to tutorials!\n\n## Installing `astartes`\nWe recommend installing `astartes` within a virtual environment, using either `venv` or `conda` (or other tools) to simplify dependency management. Python versions 3.8, 3.9, 3.10, 3.11, and 3.12 are supported on all platforms.\n\n\u003e **Warning**\n\u003e Windows (PowerShell) and MacOS Catalina or newer (zsh) require double quotes around text using the `'[]'` characters (i.e. `pip install \"astartes[molecules]\"`).\n\n### `pip`\n`astartes` is available on `PyPI` and can be installed using `pip`:\n\n - To include the featurization options for chemical data, use `pip install astartes[molecules]`.\n - To install only the sampling algorithms, use `pip install astartes` (this install will have fewer dependencies and may be more readily compatible in environments with existing workflows).\n\n### `conda`\n`astartes` package is also available on `conda-forge` with this command: `conda install -c conda-forge astartes`.\nTo install `astartes` with support for featurizing molecules, use: `conda install -c conda-forge astartes aimsim`.\nThis will download the base `astartes` package as well as `aimsim`, which is the backend used for molecular featurization.\n\nThe PyPI distribution has fewer dependencies for the `molecules` subpackage because it uses `aimsim_core` instead of `aimsim`.\nYou can achieve this on `conda` by first running `conda install -c conda-forge astartes` and then `pip install aimsim_core` (`aimsim_core` is not available on `conda-forge`).\n\n### Source\nTo install `astartes` from source for development, see the [Contributing \u0026 Developer Notes](#contributing--developer-notes) section.\n\n## Statement of Need\nMachine learning has sparked an explosion of progress in chemical kinetics, materials science, and many other fields as researchers use data-driven methods to accelerate steps in traditional workflows within some acceptable error tolerance. \nTo facilitate adoption of these models, there are two important tasks to consider:\n1. use a validation set when selecting the optimal hyperparameter for the model and separately use a held-out test set to measure performance on unseen data.\n2. evaluate model performance on both interpolative and extrapolative tasks so future users are informed of any potential limitations.\n\n`astartes` addresses both of these points by implementing an `sklearn`-compatible `train_val_test_split` function.\nAdditional technical detail is provided below as well as in our companion paper in the Journal of Open Source Software: [Machine Learning Validation via Rational Dataset Sampling with astartes](https://joss.theoj.org/papers/10.21105/joss.05996).\nFor a demo-based explainer using machine learning on a fast food menu, see the `astartes` Reproducible Notebook published at the United States Research Software Engineers Conference at [this page](https://jacksonburns.github.io/use-rse-23-astartes/split_comparisons.html).\n\n### Target Audience\n`astartes` is generally applicable to machine learning involving both discovery and inference _and_ model validation.\nThere are specific functions in `astartes` for applications in cheminformatics (`astartes.molecules`) but the methods implemented are general to all numerical data.\n\n## Quick Start\n`astartes` is designed as a drop-in replacement for `sklearn`'s `train_test_split` function (see the [sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)). To switch to `astartes`, change `from sklearn.model_selection import train_test_split` to `from astartes import train_test_split`.\n\nLike `sklearn`, `astartes` accepts any iterable object as `X`, `y`, and `labels`.\nEach will be converted to a `numpy` array for internal operations, and returned as a `numpy` array with limited exceptions: if `X` is a `pandas` `DataFrame`, `y` is a `Series`, or `labels` is a `Series`, `astartes` will cast it back to its original type including its index and column names.\n\n\u003e **Note**\n\u003e The developers recommend passing `X`, `y`, and `labels` as `numpy` arrays and handling the conversion to and from other types explicitly on your own. Behind-the-scenes type casting can lead to unexpected behavior!\n\nBy default, `astartes` will split data randomly. Additionally, a variety of algorithmic sampling approaches can be used by specifying the `sampler` argument to the function (see the [Table of Implemented Samplers](#implemented-sampling-algorithms) for a complete list of options and their corresponding references):\n\n```python\nfrom sklearn.datasets import load_diabetes\n\nX, y = load_diabetes(return_X_y=True)\n\nX_train, X_test, y_train, y_test = train_test_split(\n  X,  # preferably numpy arrays, but astartes will cast it for you\n  y,\n  sampler = 'kennard_stone',  # any of the supported samplers\n)\n```\n\n\u003e **Note**\n\u003e Extrapolation sampling algorithms will return an additional set of arrays (the cluster labels) which will result in a `ValueError: too many values to unpack` if not called properly. See the [`split_comparisons` Google colab demo](https://colab.research.google.com/github/JacksonBurns/astartes/blob/main/examples/split_comparisons/split_comparisons.ipynb) for a full explanation.\n\nThat's all you need to get started with `astartes`!\nThe next sections include more examples and some demo notebooks you can try in your browser.\n\n### Example Notebooks\n\nClick the badges in the table below to be taken to a live, interactive demo of `astartes`:\n\n| Demo | Topic | Link |\n|:---:|---|---|\n| Comparing Sampling Algorithms with Fast Food | Visual representations of how different samplers affect data partitioning | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JacksonBurns/astartes/blob/main/examples/split_comparisons/split_comparisons.ipynb) |\n| Using `train_val_test_split` with the `sklearn` example datasets | Demonstrating how witholding a test set with `train_val_test_split` can impact performance | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JacksonBurns/astartes/blob/main/examples/train_val_test_split_sklearn_example/train_val_test_split_example.ipynb) |\n| Cheminformatics sample set partitioning with `astartes` | Extrapolation vs. Interpolation impact on cheminformatics model accuracy | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JacksonBurns/astartes/blob/main/examples/barrier_prediction_with_RDB7/RDB7_barrier_prediction_example.ipynb) |\n| Comparing partitioning approaches for alkanes | Visualizing how sampler impact model performance with simple chemicals | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JacksonBurns/astartes/blob/main/examples/mlpds_2023_astartes_demonstration/mlpds_2023_demo.ipynb) |\n\nTo execute these notebooks locally, clone this repository (i.e. `git clone https://github.com/JacksonBurns/astartes.git`), navigate to the `astartes` directory, run `pip install .[demos]`, then open and run the notebooks in your preferred editor.\nYou do _not_ need to execute the cells prefixed with `%%capture` - they are only present for compatibility with Google Colab.\n\n#### Packages Using `astartes`\n - [Chemprop](https://github.com/chemprop/chemprop), a machine learning library for chemical property prediction, uses `astartes` in the backend for splitting molecular structures.\n - [`fastprop`](https://github.com/JacksonBurns/fastprop), a descriptor-based property prediction library, uses `astartes`.\n - [Google Scholar of articles citing the JOSS paper for `astartes`](https://scholar.google.com/scholar?cites=4693802000464819413\u0026as_sdt=40000005\u0026sciodt=0,22\u0026hl=en)\n\n### Withhold Testing Data with `train_val_test_split`\nFor rigorous ML research, it is critical to withhold some data during training to use a `test` set.\nThe model should _never_ see this data during training (unlike the validation set) so that we can get an accurate measurement of its performance.\n\nWith `astartes` performing this three-way data split is readily available with `train_val_test_split`:\n```python\nfrom astartes import train_val_test_split\n\nX_train, X_val, X_test = train_val_test_split(X, sampler = 'sphere_exclusion')\n```\nYou can now train your model with `X_train`, optimize your model with `X_val`, and measure its performance with `X_test`.\n\n### Evaluate the Impact of Splitting Algorithms on Regression Models\nFor data with many features it can be difficult to visualize how different sampling algorithms change the distribution of data into training, validation, and testing like we do in some of the demo notebooks.\nTo aid in analyzing the impact of the algorithms, `astartes` provides `generate_regression_results_dict`.\nThis function allows users to quickly evaluate the impact of different splitting techniques on any `sklearn`-compatible model's performance.\nAll results are stored in a nested dictionary (`{sampler:{metric:{split:score}}}`) format and can be displayed in a neatly formatted table using the optional `print_results` argument.\n\n```python\nfrom sklearn.svm import LinearSVR\n\nfrom astartes.utils import generate_regression_results_dict as grrd\n\nsklearn_model = LinearSVR()\nresults_dict = grrd(\n    sklearn_model,\n    X,\n    y,\n    print_results=True,\n)\n\n         Train       Val      Test\n----  --------  --------  --------\nMAE   1.41522   3.13435   2.17091\nRMSE  2.03062   3.73721   2.40041\nR2    0.90745   0.80787   0.78412\n\n```\n\nAdditional metrics can be passed to `generate_regression_results_dict` via the `additional_metrics` argument, which should be a dictionary mapping the name of the metric (as a `string`) to the function itself, like this:\n\n```python\nfrom sklearn.metrics import mean_absolute_percentage_error\n\nadd_met = {\"mape\": mean_absolute_percentage_error}\n\ngrrd(sklearn_model, X, y, additional_metric=add_met)\n```\n\nSee the docstring for `generate_regression_results_dict` (with `help(generate_regression_results_dict)`) for more information.\n\n### Using `astartes` with Categorical Data\nAny of the implemented sampling algorithms whose hyperparameters allow specifying the `metric` or `distance_metric` (effectively `1-metric`) can be co-opted to work with categorical data.\nSimply encode the data in a format compatible with the `sklearn` metric of choice and then call `astartes` with that metric specified:\n```python\nfrom sklearn.metrics import jaccard_score\n\nX_train, X_test, y_train, y_test = train_test_split(\n  X,\n  y,\n  sampler='kennard_stone',\n  hopts={\"metric\": jaccard_score},\n)\n```\n\nOther samplers which do not allow specifying a categorical distance metric did not provide a method for doing so in their original inception, though it is possible that they can be adapted for this application.\nIf you are interested in adding support for categorical metrics to an existing sampler, consider opening a [Feature Request](https://github.com/JacksonBurns/astartes/issues/new?assignees=\u0026labels=enhancement\u0026projects=\u0026template=feature_request.md\u0026title=%5BFEATURE%5D%3A+)!\n\n### Access Sampling Algorithms Directly\nThe sampling algorithms implemented in `astartes` can also be directly accessed and run if it is more useful for your applications.\nIn the below example, we import the Kennard Stone sampler, use it to partition a simple array, and then retrieve a sample.\n```python\nfrom astartes.samplers.interpolation import KennardStone\n\nkennard_stone = KennardStone([[1, 2], [3, 4], [5, 6]])\nfirst_2_samples = kennard_stone.get_sample_idxs(2)\n```\nAll samplers in `astartes` implement a `_sample()` method that is called by the constructor (i.e. greedily) and either a `get_sampler_idxs` or `get_cluster_idxs` for interpolative and extrapolative samplers, respectively.\nFor more detail on the implementaiton and design of samplers in `astartes`, see the [Developer Notes](#contributing--developer-notes) section.\n\n## Theory and Application of `astartes`\nThis section of the README details some of the theory behind why the algorithms implemented in `astartes` are important and some motivating examples.\nFor a comprehensive walkthrough of the theory and implementation of `astartes`, follow [this link](https://github.com/JacksonBurns/astartes/raw/joss-paper/Burns-Spiekermann-Bhattacharjee_astartes.pdf) to read the companion paper (freely available and hosted here on GitHub).\n\n\u003e **Note**\n\u003e We reference open-access publications wherever possible. For articles locked behind a paywall (denoted with :small_blue_diamond:), we instead suggest reading [this Wikipedia page](https://en.wikipedia.org/wiki/Sci-Hub) and absolutely __not__ attempting to bypass the paywall.\n\n### Rational Splitting Algorithms\nWhile much machine learning is done with a random choice between training/validation/test data, an alternative is the use of so-called \"rational\" splitting algorithms.\nThese approaches use some similarity-based algorithm to divide data into sets.\nSome of these algorithms include Kennard-Stone ([Kennard \u0026 Stone](https://www.tandfonline.com/doi/abs/10.1080/00401706.1969.10490666) :small_blue_diamond:), Sphere Exclusion ([Tropsha et. al](https://pubs.acs.org/doi/pdf/10.1021/ci300338w) :small_blue_diamond:),as well as the OptiSim as discussed in [Applied Chemoinformatics: Achievements and Future Opportunities](https://www.wiley.com/en-us/Applied+Chemoinformatics%3A+Achievements+and+Future+Opportunities-p-9783527806546) :small_blue_diamond:.\nSome clustering-based splitting techniques have also been incorporated, such as [DBSCAN](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1016.890\u0026rep=rep1\u0026type=pdf).\n\nThere are two broad categories of sampling algorithms implemented in `astartes`: extrapolative and interpolative.\nThe former will force your model to predict on out-of-sample data, which creates a more challenging task than interpolative sampling.\nSee the table below for all of the sampling approaches currently implemented in `astartes`, as well as the hyperparameters that each algorithm accepts (which are passed in with `hopts`) and a helpful reference for understanding how the hyperparameters work.\nNote that `random_state` is defined as a keyword argument in `train_test_split` itself, even though these algorithms will use the `random_state` in their own work.\nDo not provide a `random_state` in the `hopts` dictionary - it will be overwritten by the `random_state` you provide for `train_test_split` (or the default if none is provided).\n\n#### Implemented Sampling Algorithms\n\n| Sampler Name | Usage String | Type | Hyperparameters | Reference | Notes |\n|:---:|---|---|---|---|---|\n| Random | 'random' | Interpolative | `shuffle` | [sklearn train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) Documentation | This sampler is a direct passthrough to `sklearn`'s `train_test_split`. |\n| Kennard-Stone | 'kennard_stone' | Interpolative | `metric` | Original Paper by [Kennard \u0026 Stone](https://www.tandfonline.com/doi/abs/10.1080/00401706.1969.10490666) :small_blue_diamond: | Euclidian distance is used by default, as described in the original paper. |\n| Sample set Partitioning based on joint X-Y distances (SPXY) | 'spxy' | Interpolative | `distance_metric`, `distance_metric_X`, `distance_metric_y` (*) | Saldhana et. al [original paper](https://www.sciencedirect.com/science/article/abs/pii/S003991400500192X) :small_blue_diamond: | Extension of Kennard Stone that also includes the response when sampling distances. |\n| Mahalanobis Distance Kennard Stone (MDKS) | 'spxy' _(MDKS is derived from SPXY)_ | Interpolative | _none, see Notes_ | Saptoro et. al [original paper](https://espace.curtin.edu.au/bitstream/handle/20.500.11937/45101/217844_70585_PUB-SE-DCE-FM-71008.pdf?sequence=2\u0026isAllowed=y) | MDKS is SPXY using Mahalanobis distance and can be called by using SPXY with `distance_metric=\"mahalanobis\"` |\n| Scaffold | 'scaffold' | Extrapolative | `include_chirality` | [Bemis-Murcko Scaffold](https://pubs.acs.org/doi/full/10.1021/jm9602928) :small_blue_diamond: as implemented in RDKit | This sampler requires SMILES strings as input (use the `molecules` subpackage) |\n| Molecular Weight| 'molecular_weight' | Extrapolative | _none_ | ~ | Sorts molecules by molecular weight as calculated by RDKit |\n| Sphere Exclusion | 'sphere_exclusion' | Extrapolative | `metric`, `distance_cutoff` | _custom implementation_ | Variation on Sphere Exclusion for arbitrary-valued vectors. |\n| Time Based | 'time_based' | Extrapolative | _none_ | Papers using Time based splitting: [Chen et al.](https://pubs.acs.org/doi/full/10.1021/ci200615h) :small_blue_diamond:, [Sheridan, R. P](https://pubs.acs.org/doi/full/10.1021/ci400084k) :small_blue_diamond:, [Feinberg et al.](https://pubs.acs.org/doi/full/10.1021/acs.jmedchem.9b02187) :small_blue_diamond:, [Struble et al.](https://pubs.rsc.org/en/content/articlehtml/2020/re/d0re00071j) | This sampler requires `labels` to be an iterable of either date or datetime objects. |\n| Target Property | 'target_property' | Extrapolative | `descending` | ~ | Sorts data by regression target y |\n| Optimizable K-Dissimilarity Selection (OptiSim) | 'optisim' | Extrapolative | `n_clusters`, `max_subsample_size`, `distance_cutoff` | _custom implementation_ | Variation on [OptiSim](https://pubs.acs.org/doi/10.1021/ci025662h) for arbitrary-valued vectors. |\n| K-Means | 'kmeans' | Extrapolative | `n_clusters`, `n_init` | [`sklearn KMeans`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) | Passthrough to `sklearn`'s `KMeans`. |\n| Density-Based Spatial Clustering of Applications with Noise (DBSCAN) | 'dbscan' | Extrapolative | `eps`, `min_samples`, `algorithm`, `metric`, `leaf_size` | [`sklearn DBSCAN`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) Documentation| Passthrough to `sklearn`'s `DBSCAN`. |\n| Minimum Test Set Dissimilarity (MTSD) | ~ | ~ | _upcoming in_ `astartes` _v1.x_ | ~ | ~ |\n| Restricted Boltzmann Machine (RBM) | ~ | ~ | _upcoming in_ `astartes` _v1.x_ | ~ | ~ |\n| Kohonen Self-Organizing Map (SOM) | ~ | ~ | _upcoming in_ `astartes` _v1.x_ | ~ | ~ |\n| SPlit Method | ~ | ~ | _upcoming in_ `astartes` _v1.x_ | ~ | ~ |\n\n(*) specifying `distance_metric_X` or `distance_metric_y` will override the choice of `distance_metric`\n\n### Domain-Specific Applications\nBelow are some field specific applications of `astartes`. Interested in adding a new sampling algorithm or featurization approach? See [`CONTRIBUTING.md`](./CONTRIBUTING.md).\n\n#### Chemical Data and the `astartes.molecules` Subpackage\nMachine Learning is enormously useful in chemistry-related fields due to the high-dimensional feature space of chemical data.\nTo properly apply ML to chemical data for inference _or_ discovery, it is important to know a model's accuracy under the two domains.\nTo simplify the process of partitioning chemical data, `astartes` implements a pre-built featurizer for common chemistry data formats.\nAfter installing with `pip install astartes[molecules]` one can import the new train/test splitting function like this: `from astartes.molecules import train_test_split_molecules`\n\nThe usage of this function is identical to `train_test_split` but with the addition of new arguments to control how the molecules are featurized:\n\n```python\ntrain_test_split_molecules(\n    molecules=smiles,\n    y=y,\n    test_size=0.2,\n    train_size=0.8,\n    fingerprint=\"daylight_fingerprint\",\n    fprints_hopts={\n        \"fpSize\": 200,\n        \"numBitsPerFeature\": 4,\n        \"useHs\": True,\n    },\n    sampler=\"random\",\n    random_state=42,\n    hopts={\n        \"shuffle\": True,\n    },\n)\n```\n\nTo see a complete example of using `train_test_split_molecules` with actual chemical data, take a look in the `examples` directory and the brief [companion paper](https://github.com/JacksonBurns/astartes/raw/joss-paper/Burns-Spiekermann-Bhattacharjee_astartes.pdf).\n\nConfiguration options for the featurization scheme can be found in the documentation for [AIMSim](https://vlachosgroup.github.io/AIMSim/README.html#currently-implemented-fingerprints) though most of the critical configuration options are shown above.\n\n## Reproducibility\n`astartes` aims to be completely reproducible across different platforms, Python versions, and dependency configurations - any version of `astartes` v1.x should result in the _exact_ same splits, always.\nTo that end, the default behavior of `astartes` is to use `42` as the random seed and _always_ set it.\nRunning `astartes` with the default settings will always produce the exact same results.\nWe have verified this behavior on Debian Ubuntu, Windows, and Intel Macs from Python versions 3.7 through 3.11 (with appropriate dependencies for each version).\n\n### Known Reproducibility Limitations\nInevitably external dependencies of `astartes` will introduce backwards-incompatible changes.\nWe continually run regression tests to catch these, and will list all _known_ limitations here:\n - `sklearn` v1.3.0 introduced backwards-incompatible changes in the `KMeans` sampler that changed how the random initialization affects the results, even given the same random seed. Different version of `sklearn` will affect the performance of `astartes` and we recommend including the exact version of `scikit-learn` and `astartes` used, when applicable.\n\n\u003e **Note**\n\u003e We are limited in our ability to test on M1 Macs, but from our limited manual testing we achieve perfect reproducbility in all cases _except occasionally_ with `KMeans` on Apple silicon.\n`astartes` is still consistent between runs on the same platform in all cases, and other samplers are not impacted by this apparent bug.\n\n## How to Cite\nIf you use `astartes` in your work please follow the link below to our (Open Access!) paper in the Journal of Open Source Software or use the \"Cite this repository\" button on GitHub.\n\n[Machine Learning Validation via Rational Dataset Sampling with astartes](https://joss.theoj.org/papers/10.21105/joss.05996)\n\n## Contributing \u0026 Developer Notes\nSee [CONTRIBUTING.md](./CONTRIBUTING.md) for instructions on installing `astartes` for development, making a contribution, and general guidance on the design of `astartes`.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjacksonburns%2Fastartes","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjacksonburns%2Fastartes","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjacksonburns%2Fastartes/lists"}