https://github.com/trailofbits/datasig

Dataset fingerprinting for AIBOM
https://github.com/trailofbits/datasig

Last synced: about 1 month ago
JSON representation

Dataset fingerprinting for AIBOM

Host: GitHub
URL: https://github.com/trailofbits/datasig
Owner: trailofbits
License: agpl-3.0
Created: 2024-11-06T08:26:11.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-04-04T15:00:01.000Z (3 months ago)
Last Synced: 2025-04-04T16:27:48.292Z (3 months ago)
Language: Python
Homepage:
Size: 157 KB
Stars: 1
Watchers: 3
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-ai-security - datasig - _Dataset fingerprinting for AIBOM_ (Defensive tools and frameworks / Data security and governance)

README

        # Dataset fingerprinting

This repository contains our proof-of-concept for fingerprinting a dataset.

## Local installation

```python

git clone https://github.com/trailofbits/datasig && cd datasig

python3 -m pip install .

```

## Usage

The code below shows experimental usage of the library.

This will be subject to frequent changes in early development stages. 

```python

from torchvision.datasets import MNIST

from datasig.dataset import TorchVisionDataset, CanonicalDataset

torch_dataset = MNIST(root="/tmp/data", train=True, download=True)

dataset = TorchVisionDataset(torch_dataset)

canonical = CanonicalDataset(dataset)

print("Dataset UID: ", canonical.uid)

print("Dataset fingerprint: ", canonical.fingerprint)

```

## Development

### Unit tests

Tests are in the `datasig/test` directory. You can run the tests with:

```bash

python3 -m pytest # Run all tests

python3 -m pytest -s datasig/test/test_csv.py # Run only one test file

python3 -m pytest -s datasig/test/test_csv.py -k test_similarity # Run only one specific test function

```

### Profiling

The profiling script generates a profile for dataset processing and fingerprint generation using cProfile. To profile the MNIST dataset from the torch framework,

you can run:

```bash

python3 profiling.py torch_mnist --full

```

The `--full` argument tells the script to include dataset canonization, UID generation, and fingerprint generation in the profile. If you want to profile only some of these steps you can cherry pick by using or omitting the following arguments instead:

```bash

python3 profiling.py torch_mnist --canonical --uid --fingerprint

```

You can optionally specify the datasig config version to use (at the time of writing we have only v0) with:  

```bash

python3 profiling.py torch_mnist -v 0 --all

```

Currently we support only one target dataset: `torch_mnist`. To add another dataset, add a class in `profiling.py` similar to `TorchMNISTV0`, that implements the `_setup()` method which is responsible for loading the dataset.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/trailofbits/datasig

Awesome Lists containing this project

README