https://github.com/ljvmiranda921/vs-split

A Python library for creating adversarial splits
https://github.com/ljvmiranda921/vs-split
adversarial-examples adversarial-machine-learning machine-learning python
Last synced: 6 months ago
JSON representation
A Python library for creating adversarial splits
Host: GitHub
URL: https://github.com/ljvmiranda921/vs-split
Owner: ljvmiranda921
License: mit
Created: 2022-06-29T06:12:05.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2022-07-24T07:41:37.000Z (about 3 years ago)
Last Synced: 2025-04-23T04:15:40.743Z (6 months ago)
Topics: adversarial-examples, adversarial-machine-learning, machine-learning, python
Language: Python
Homepage:
Size: 1.93 MB
Stars: 13
Watchers: 2
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # ⚔️ vs-split: a library for creating adversarial splits

> **Warning**

> This library is still a work in progress. Use at your own risk!

Have you ever encountered a problem where **your model works well in your test set

but doesn't perform well in the wild?**  It's likely because your test set does

not reflect the reality of your domain, overestimating your model's performance.[^1]

This library provides **alternative ways to split and sanity-check your datasets**

and ensure they're robust once you deploy them into production.

[^1]: Check out my blog post, [*Your train-test split may be doing you a disservice*](https://ljvmiranda921.github.io/2022/08/30/adversarial-splits/), for a technical overview of this problem.

## ⏳ Installation

You can install `vs-split` via `pip`

```sh

pip install vs-split

```

Or alternatively, you can install from source:

```sh

git clone https://github.com/ljvmiranda921/vs-split

cd vs-split

python setup.py install

```

## 👩‍💻 Usage

The library exposes two main functions: 

- **`train_test_split(X: Iterable, y: Iterable, split_id: str, **attrs)`** that accepts [NumPy arrays](https://numpy.org/doc/stable/reference/generated/numpy.array.html) of your features and labels. You can pass any arbitrary NumPy array or list for splitting.

- **`spacy_train_test_split(docs: Iterable[Doc], split_id: str, **attrs)`** that accepts an iterable of [spaCy Doc objects](https://spacy.io/api/doc).[^2] [spaCy](https://spacy.io) is a Python library for natural language processing and the Doc object is one of its core data structures. This function is useful if you're working on linguistic data.  

For both functions, you can provide the type of split in the `split_id`

parameter (c.f. [splitters catalogue](#splitters-catalogue)) and pass custom

keyword-arguments.

```python

from vs_split import train_test_split, spacy_train_test_split

# For most datasets

X_train, y_train, X_test, y_test = train_test_split(X_data, y_data, split_id="wasserstein.v1")

# For spaCy Doc objects

docs_train, docs_test = spacy_train_test_split(docs, split_id="wasserstein-spacy.v1")

```

> **Note**

> It might look like `vs-split` has a similar API with [scikit-learn's

> `train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html),

> but that's not the case.  Unlike the latter, `vs_split.train_test_split` doesn't expect

> an arbitrary number of iterables, and the keyword parameters are also different.

[^2]: vs-split has first-class support for spaCy. The main reason is that I've been using this for some internal robustness experiments to test some of our [pipeline components](https://spacy.io/usage/processing-pipelines).

### Registering your own splitters

You can also register custom splitters via the `splitters` catalogue. Here's an

example of a splitter, `random-spacy.v1` that splits a list of spaCy Doc objects

given a training set size:

```python

import random

from typing import Iterable

from spacy.tokens import Doc

from vs_split.splitters import splitters

@splitters.register("random-spacy.v1")

def random_spacy(docs: Iterable[Doc], train_size: float):

    random.shuffle(docs)

    num_train = int(len(docs) * train_size)

    train_docs = docs[:num_train]

    test_docs = docs[num_train:]

    return train_docs, test_docs

```

Under the hood, `vs-split` uses

[`catalogue`](https://github.com/explosion/catalogue) to manage the functions

you registered. You are given freedom to return any value / object in your

splitter implementation—i.e, there's no function that enforces you to

follow the blueprint. However, for consistency, it's advisable to follow the

type signature of the other splitters.

### More examples

You can find more in the

[`examples/`](https://github.com/ljvmiranda921/vs-split/tree/main/examples)

directory. It contains a sample project that runs the [English WikiNeural

dataset](https://paperswithcode.com/dataset/wikineural) on various spaCy

splitters.

## 🎛 API

### function `train_test_split`

Split a dataset into its training and testing partitions. By default, it should

return the training and testing features and labels respectively. 

| Argument    | Type       | Description                                            |

|-------------|------------|--------------------------------------------------------|

| `*X`        | Iterable   | An iterable of features, preferably a `numpy.ndarray`. |

| `*y`        | Iterable   | An iterable of labels, preferably a `numpy.ndarray`.   |

| `*split_id` | str        | The type of split to use.                              |

| **RETURNS** | Tuple[Iterable[Any], Iterable[Any], Iterable[Any], Iterable[Any]] | The training and testing features and labels (i.e. `X_train`, `y_train`, `X_test`, `y_test`). |

### function `spacy_train_test_split`

Split a list of spaCy `Doc` objects into its training and testing partitions. By default, it should return the training and test spaCy Doc objects respectively.

| Argument    | Type         | Description                                            |

|-------------|--------------|--------------------------------------------------------|

| `*docs`     | Iterable[Doc]| An iterable of spaCy Doc objects to split.             |

| `*split_id` | str          | The type of split to use.                              |

| **RETURNS** | Tuple[Iterable[Doc], Iterable[Doc]] | The training and testing spaCy Doc objects. |

### Splitters Catalogue

### vs_split.splitters `wasserstein.v1`

Perform adversarial splitting using a divergence maximization method involving [Wasserstein distance](https://en.wikipedia.org/wiki/Wasserstein_metric).

This method approximates the test split by performing nearest-neighbor search on

a random centroid. Based on Søgaard, Ebert et al.'s work on '[We Need to Talk

About Random Splits](https://aclanthology.org/2021.eacl-main.156/)' (EACL 2021).

| Argument    | Type       | Description                                            |

|-------------|------------|--------------------------------------------------------|

| `*X`        | Iterable   | An iterable of features, preferably a `numpy.ndarray`. |

| `*y`        | Iterable   | An iterable of labels, preferably a `numpy.ndarray`.   |

| `test_size` | float      | The number of neighbors to query. Defaults to `0.2`    |

| `leaf_size` | int        | The leaf size parameter for nearest neighbor search. High values are slower. Defaults to `3`.    |

| **RETURNS** | Tuple[Iterable[Any], Iterable[Any], Iterable[Any], Iterable[Any]] | The training and testing features and labels (i.e. `X_train`, `y_train`, `X_test`, `y_test`). |

### vs_split.splitters `spacy-wasserstein.v1`

spaCy-compatible version of `wasserstein.v1`. If no vectors were found in the 

`Doc` object, then TF-IDF is computed.

| Argument    | Type         | Description                                            |

|-------------|--------------|--------------------------------------------------------|

| `*docs`     | Iterable[Doc]| An iterable of spaCy Doc objects to split.             |

| `test_size` | float      | The number of neighbors to query. Defaults to `0.2`.    |

| `leaf_size` | int        | The leaf size parameter for nearest neighbor search. High values are slower. Defaults to `3`.    |

| `use_counts`| bool       | Use count vectors instead of initialized vectors. If no vectors were found, the count vectors are automatically used. Defaults to `False`.   | 

| `min_df`    | Union[int, float] | remove terms that appear too infrequently given a threshold. Defaults to `0.10`. | 

| `n_jobs`    | Optional[int]   | Number of parallel jobs to run for neighbor search. Defaults to `-1` (use all CPUs). |

| **RETURNS** | Tuple[Iterable[Doc], Iterable[Doc]] | The training and testing spaCy Doc objects. |

### vs_split.splitters `doc-length.v1`

Heuristic split based on document length.

By default, it looks for a sentence length threshold, and puts all the longer

sentences in the test split. The threshold is chosen so that approximately 10%

of the data ends up in the test set. 

| Argument    | Type         | Description                                            |

|-------------|--------------|--------------------------------------------------------|

| `*docs`     | Iterable[Doc]| An iterable of spaCy Doc objects to split.             |

| `test_size` | Optional[float]      | The size of the test set for determining the split. Defaults to `0.1`.    |

| `length_threshold` | Optional[int] | Arbitrary length to split the dataset against. Defaults to `None`. |

| **RETURNS** | Tuple[Iterable[Doc], Iterable[Doc]] | The training and testing spaCy Doc objects. |

### vs_split.splitters `morph-attrs-split.v1`

Perform a heuristic split based on morphological attributes.

This method is loosely-based on the paper: '[(Un)solving Morphological Inflection: Lemma Overlap Artificially Inflates Models' Performance](https://aclanthology.org/2022.acl-short.96/)' by Goldman

et. al (ACL 2022). However, instead of focusing solely on lemma splits, this

method uses morphological attributes. The main motivation is because splitting

on lemma doesn't translate on standard texts.

| Argument    | Type         | Description                                            |

|-------------|--------------|--------------------------------------------------------|

| `*docs`     | Iterable[Doc]| An iterable of spaCy Doc objects to split.             |

| `attrs`     | List[str]     | Morphological attributes to split against. Default is `["Number", "Person"]`.

| `test_size` | Optional[float]      | The size of the test set for determining the split. Defaults to `0.1`.    |

| **RETURNS** | Tuple[Iterable[Doc], Iterable[Doc]] | The training and testing spaCy Doc objects. |

### vs_split.splitters `entity-switch.v1`

Manually perturb the test set by switching entities based on a given

dictionary of patterns.

This work is based on the paper, '[Entity-Switched Datasets - An Approach to

Auditing the In-Domain Robustness of Named Entity Recognition

Models](https://arxiv.org/abs/2004.04123)' by Agarwal et al. You can control

which entity labels are switched using a **patterns dictionary**.

The patterns dictionary should have **the entity label as the key and a list of

strings as its values.** For example, if we want to switch all `ORG` entities in

the original document with values such as `Bene Gesserit`, `Landsraad`, or

`Spacing Guild`, then we should provide a dictionary that look like this:

```python

# An example patterns file

patterns = {'ORG': ['Bene Gesserit', 'Landsraad', 'Spacing Guild']}

```

You can add as many patterns or entity labels in the dictionary. The pattern

chosen for substitution is done via

[`random.choice`](https://docs.python.org/3/library/random.html#random.choice).

Lastly, for `PER` entities, this splitter **does not** differentiate between

first or full names. It just performs a drop-in replacement.

> **Note**

> Implementation-wise, the entity switching is done by recreating the spaCy

> Doc object.  Note that the resulting Docs will only include the text and the

> entity annotations. Any information from the previous pipeline (MORPHS,

> etc.) will be lost.

| Argument    | Type         | Description                                            |

|-------------|--------------|--------------------------------------------------------|

| `*docs`     | Iterable[Doc]| An iterable of spaCy Doc objects to split.             |

| `*patterns` | Dict[str, List[str]] | Dictionary of patterns for substitution.             |

| `test_size` | Optional[float]      | If provided, then the docs will be split further. Since entity-switching is only needed for the test set, you can just pass the test documents in this function. Defaults to `None`.    |

| **RETURNS** | Tuple[Iterable[Doc], Iterable[Doc]] | The training and testing spaCy Doc objects. |
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ljvmiranda921/vs-split

Awesome Lists containing this project

README