{"id":15519010,"url":"https://github.com/ljvmiranda921/vs-split","last_synced_at":"2025-04-23T04:15:56.794Z","repository":{"id":41511121,"uuid":"508568532","full_name":"ljvmiranda921/vs-split","owner":"ljvmiranda921","description":"A Python library for creating adversarial splits ","archived":false,"fork":false,"pushed_at":"2022-07-24T07:41:37.000Z","size":2020,"stargazers_count":13,"open_issues_count":2,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-23T04:15:40.743Z","etag":null,"topics":["adversarial-examples","adversarial-machine-learning","machine-learning","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ljvmiranda921.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-06-29T06:12:05.000Z","updated_at":"2023-05-18T15:14:22.000Z","dependencies_parsed_at":"2022-07-12T23:20:43.067Z","dependency_job_id":null,"html_url":"https://github.com/ljvmiranda921/vs-split","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ljvmiranda921%2Fvs-split","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ljvmiranda921%2Fvs-split/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ljvmiranda921%2Fvs-split/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ljvmiranda921%2Fvs-split/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ljvmiranda921","download_url":"https://codeload.github.com/ljvmiranda921/vs-split/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250366715,"owners_count":21418772,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["adversarial-examples","adversarial-machine-learning","machine-learning","python"],"created_at":"2024-10-02T10:19:49.448Z","updated_at":"2025-04-23T04:15:56.758Z","avatar_url":"https://github.com/ljvmiranda921.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ⚔️ vs-split: a library for creating adversarial splits\n\n\u003e **Warning**\n\u003e This library is still a work in progress. Use at your own risk!\n\nHave you ever encountered a problem where **your model works well in your test set\nbut doesn't perform well in the wild?**  It's likely because your test set does\nnot reflect the reality of your domain, overestimating your model's performance.[^1]\n\nThis library provides **alternative ways to split and sanity-check your datasets**\nand ensure they're robust once you deploy them into production.\n\n[^1]: Check out my blog post, [*Your train-test split may be doing you a disservice*](https://ljvmiranda921.github.io/2022/08/30/adversarial-splits/), for a technical overview of this problem.\n\n## ⏳ Installation\n\nYou can install `vs-split` via `pip`\n\n```sh\npip install vs-split\n```\n\nOr alternatively, you can install from source:\n\n```sh\ngit clone https://github.com/ljvmiranda921/vs-split\ncd vs-split\npython setup.py install\n```\n\n## 👩‍💻 Usage\n\nThe library exposes two main functions: \n\n- **`train_test_split(X: Iterable, y: Iterable, split_id: str, **attrs)`** that accepts [NumPy arrays](https://numpy.org/doc/stable/reference/generated/numpy.array.html) of your features and labels. You can pass any arbitrary NumPy array or list for splitting.\n- **`spacy_train_test_split(docs: Iterable[Doc], split_id: str, **attrs)`** that accepts an iterable of [spaCy Doc objects](https://spacy.io/api/doc).[^2] [spaCy](https://spacy.io) is a Python library for natural language processing and the Doc object is one of its core data structures. This function is useful if you're working on linguistic data.  \n\nFor both functions, you can provide the type of split in the `split_id`\nparameter (c.f. [splitters catalogue](#splitters-catalogue)) and pass custom\nkeyword-arguments.\n\n```python\nfrom vs_split import train_test_split, spacy_train_test_split\n\n# For most datasets\nX_train, y_train, X_test, y_test = train_test_split(X_data, y_data, split_id=\"wasserstein.v1\")\n# For spaCy Doc objects\ndocs_train, docs_test = spacy_train_test_split(docs, split_id=\"wasserstein-spacy.v1\")\n```\n\n\u003e **Note**\n\u003e It might look like `vs-split` has a similar API with [scikit-learn's\n\u003e `train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html),\n\u003e but that's not the case.  Unlike the latter, `vs_split.train_test_split` doesn't expect\n\u003e an arbitrary number of iterables, and the keyword parameters are also different.\n\n[^2]: vs-split has first-class support for spaCy. The main reason is that I've been using this for some internal robustness experiments to test some of our [pipeline components](https://spacy.io/usage/processing-pipelines).\n\n### Registering your own splitters\n\nYou can also register custom splitters via the `splitters` catalogue. Here's an\nexample of a splitter, `random-spacy.v1` that splits a list of spaCy Doc objects\ngiven a training set size:\n\n```python\nimport random\nfrom typing import Iterable\n\nfrom spacy.tokens import Doc\nfrom vs_split.splitters import splitters\n\n@splitters.register(\"random-spacy.v1\")\ndef random_spacy(docs: Iterable[Doc], train_size: float):\n    random.shuffle(docs)\n    num_train = int(len(docs) * train_size)\n    train_docs = docs[:num_train]\n    test_docs = docs[num_train:]\n    return train_docs, test_docs\n```\n\nUnder the hood, `vs-split` uses\n[`catalogue`](https://github.com/explosion/catalogue) to manage the functions\nyou registered. You are given freedom to return any value / object in your\nsplitter implementation\u0026mdash;i.e, there's no function that enforces you to\nfollow the blueprint. However, for consistency, it's advisable to follow the\ntype signature of the other splitters.\n\n### More examples\n\nYou can find more in the\n[`examples/`](https://github.com/ljvmiranda921/vs-split/tree/main/examples)\ndirectory. It contains a sample project that runs the [English WikiNeural\ndataset](https://paperswithcode.com/dataset/wikineural) on various spaCy\nsplitters.\n\n## 🎛 API\n\n### \u003ckbd\u003efunction\u003c/kbd\u003e `train_test_split`\n\nSplit a dataset into its training and testing partitions. By default, it should\nreturn the training and testing features and labels respectively. \n\n| Argument    | Type       | Description                                            |\n|-------------|------------|--------------------------------------------------------|\n| `*X`        | Iterable   | An iterable of features, preferably a `numpy.ndarray`. |\n| `*y`        | Iterable   | An iterable of labels, preferably a `numpy.ndarray`.   |\n| `*split_id` | str        | The type of split to use.                              |\n| **RETURNS** | Tuple[Iterable[Any], Iterable[Any], Iterable[Any], Iterable[Any]] | The training and testing features and labels (i.e. `X_train`, `y_train`, `X_test`, `y_test`). |\n\n\n### \u003ckbd\u003efunction\u003c/kbd\u003e `spacy_train_test_split`\n\nSplit a list of spaCy `Doc` objects into its training and testing partitions. By default, it should return the training and test spaCy Doc objects respectively.\n\n| Argument    | Type         | Description                                            |\n|-------------|--------------|--------------------------------------------------------|\n| `*docs`     | Iterable[Doc]| An iterable of spaCy Doc objects to split.             |\n| `*split_id` | str          | The type of split to use.                              |\n| **RETURNS** | Tuple[Iterable[Doc], Iterable[Doc]] | The training and testing spaCy Doc objects. |\n\n\n### Splitters Catalogue\n\n### \u003ckbd\u003evs_split.splitters\u003c/kbd\u003e `wasserstein.v1`\n\nPerform adversarial splitting using a divergence maximization method involving [Wasserstein distance](https://en.wikipedia.org/wiki/Wasserstein_metric).\n\nThis method approximates the test split by performing nearest-neighbor search on\na random centroid. Based on Søgaard, Ebert et al.'s work on '[We Need to Talk\nAbout Random Splits](https://aclanthology.org/2021.eacl-main.156/)' (EACL 2021).\n\n| Argument    | Type       | Description                                            |\n|-------------|------------|--------------------------------------------------------|\n| `*X`        | Iterable   | An iterable of features, preferably a `numpy.ndarray`. |\n| `*y`        | Iterable   | An iterable of labels, preferably a `numpy.ndarray`.   |\n| `test_size` | float      | The number of neighbors to query. Defaults to `0.2`    |\n| `leaf_size` | int        | The leaf size parameter for nearest neighbor search. High values are slower. Defaults to `3`.    |\n| **RETURNS** | Tuple[Iterable[Any], Iterable[Any], Iterable[Any], Iterable[Any]] | The training and testing features and labels (i.e. `X_train`, `y_train`, `X_test`, `y_test`). |\n\n\n### \u003ckbd\u003evs_split.splitters\u003c/kbd\u003e `spacy-wasserstein.v1`\n\nspaCy-compatible version of `wasserstein.v1`. If no vectors were found in the \n`Doc` object, then TF-IDF is computed.\n\n| Argument    | Type         | Description                                            |\n|-------------|--------------|--------------------------------------------------------|\n| `*docs`     | Iterable[Doc]| An iterable of spaCy Doc objects to split.             |\n| `test_size` | float      | The number of neighbors to query. Defaults to `0.2`.    |\n| `leaf_size` | int        | The leaf size parameter for nearest neighbor search. High values are slower. Defaults to `3`.    |\n| `use_counts`| bool       | Use count vectors instead of initialized vectors. If no vectors were found, the count vectors are automatically used. Defaults to `False`.   | \n| `min_df`    | Union[int, float] | remove terms that appear too infrequently given a threshold. Defaults to `0.10`. | \n| `n_jobs`    | Optional[int]   | Number of parallel jobs to run for neighbor search. Defaults to `-1` (use all CPUs). |\n| **RETURNS** | Tuple[Iterable[Doc], Iterable[Doc]] | The training and testing spaCy Doc objects. |\n\n\n### \u003ckbd\u003evs_split.splitters\u003c/kbd\u003e `doc-length.v1`\n\nHeuristic split based on document length.\n\nBy default, it looks for a sentence length threshold, and puts all the longer\nsentences in the test split. The threshold is chosen so that approximately 10%\nof the data ends up in the test set. \n\n| Argument    | Type         | Description                                            |\n|-------------|--------------|--------------------------------------------------------|\n| `*docs`     | Iterable[Doc]| An iterable of spaCy Doc objects to split.             |\n| `test_size` | Optional[float]      | The size of the test set for determining the split. Defaults to `0.1`.    |\n| `length_threshold` | Optional[int] | Arbitrary length to split the dataset against. Defaults to `None`. |\n| **RETURNS** | Tuple[Iterable[Doc], Iterable[Doc]] | The training and testing spaCy Doc objects. |\n\n### \u003ckbd\u003evs_split.splitters\u003c/kbd\u003e `morph-attrs-split.v1`\n\nPerform a heuristic split based on morphological attributes.\n\nThis method is loosely-based on the paper: '[(Un)solving Morphological Inflection: Lemma Overlap Artificially Inflates Models' Performance](https://aclanthology.org/2022.acl-short.96/)' by Goldman\net. al (ACL 2022). However, instead of focusing solely on lemma splits, this\nmethod uses morphological attributes. The main motivation is because splitting\non lemma doesn't translate on standard texts.\n\n\n| Argument    | Type         | Description                                            |\n|-------------|--------------|--------------------------------------------------------|\n| `*docs`     | Iterable[Doc]| An iterable of spaCy Doc objects to split.             |\n| `attrs`     | List[str]     | Morphological attributes to split against. Default is `[\"Number\", \"Person\"]`.\n| `test_size` | Optional[float]      | The size of the test set for determining the split. Defaults to `0.1`.    |\n| **RETURNS** | Tuple[Iterable[Doc], Iterable[Doc]] | The training and testing spaCy Doc objects. |\n\n\n### \u003ckbd\u003evs_split.splitters\u003c/kbd\u003e `entity-switch.v1`\n\nManually perturb the test set by switching entities based on a given\ndictionary of patterns.\n\nThis work is based on the paper, '[Entity-Switched Datasets - An Approach to\nAuditing the In-Domain Robustness of Named Entity Recognition\nModels](https://arxiv.org/abs/2004.04123)' by Agarwal et al. You can control\nwhich entity labels are switched using a **patterns dictionary**.\n\nThe patterns dictionary should have **the entity label as the key and a list of\nstrings as its values.** For example, if we want to switch all `ORG` entities in\nthe original document with values such as `Bene Gesserit`, `Landsraad`, or\n`Spacing Guild`, then we should provide a dictionary that look like this:\n\n```python\n# An example patterns file\npatterns = {'ORG': ['Bene Gesserit', 'Landsraad', 'Spacing Guild']}\n```\n\nYou can add as many patterns or entity labels in the dictionary. The pattern\nchosen for substitution is done via\n[`random.choice`](https://docs.python.org/3/library/random.html#random.choice).\nLastly, for `PER` entities, this splitter **does not** differentiate between\nfirst or full names. It just performs a drop-in replacement.\n\n\u003e **Note**\n\u003e Implementation-wise, the entity switching is done by recreating the spaCy\n\u003e Doc object.  Note that the resulting Docs will only include the text and the\n\u003e entity annotations. Any information from the previous pipeline (MORPHS,\n\u003e etc.) will be lost.\n\n\n| Argument    | Type         | Description                                            |\n|-------------|--------------|--------------------------------------------------------|\n| `*docs`     | Iterable[Doc]| An iterable of spaCy Doc objects to split.             |\n| `*patterns` | Dict[str, List[str]] | Dictionary of patterns for substitution.             |\n| `test_size` | Optional[float]      | If provided, then the docs will be split further. Since entity-switching is only needed for the test set, you can just pass the test documents in this function. Defaults to `None`.    |\n| **RETURNS** | Tuple[Iterable[Doc], Iterable[Doc]] | The training and testing spaCy Doc objects. |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fljvmiranda921%2Fvs-split","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fljvmiranda921%2Fvs-split","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fljvmiranda921%2Fvs-split/lists"}