https://github.com/ruifilipecampos/git-datasets

Declaratively create, transform, manage and version ML datasets.
https://github.com/ruifilipecampos/git-datasets
ai data-version-control datasets git machine-learning
Last synced: 4 months ago
JSON representation
Declaratively create, transform, manage and version ML datasets.
Host: GitHub
URL: https://github.com/ruifilipecampos/git-datasets
Owner: RuiFilipeCampos
Created: 2023-10-09T21:48:29.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-11-03T08:32:16.000Z (over 2 years ago)
Last Synced: 2025-02-02T01:51:16.475Z (about 1 year ago)
Topics: ai, data-version-control, datasets, git, machine-learning
Language: Python
Homepage:
Size: 120 KB
Stars: 4
Watchers: 1
Forks: 0
Open Issues: 7
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          








  

    

  

  


    Declaratively create, transform, and manage ML datasets.

    


  



## What are you trying to do?

**At its core, git-datasets is an attempt at introducing a "data as code" paradigm.** Imagine being able to commit, revert, restore, pull, push, merge, resolve conflicts, open a PR and review data just as you do with code. All right from git and with minimal setup.

## How are you planning to do that ?

Every dataset has an `index.py` file. The promise? A committed `index.py` always tells the truth. The consequence is that every transformation occurs on `git commit index.py`. 

Let's say you want a dataset with images and object segmentations. Your `index.py` could look like this:

```python

from typing import Literal

from git_datasets import dataset

from git_datasets.files import File, jpg

@dataset

class ImageClassificationDataset:

    image: File[jpg]

    label: Literal["cat", "dog", "person"]

```

First, you set this up as a dataset using `git datasets new index.py`. When you save it with `git commit index.py` a parquet file is created with that schema. This file is hidden out of view in the `.git` folder and when you run `git push` it is uploaded to a chosen cloud provider ([apache-libcloud](https://libcloud.apache.org) will be used to support all providers)

## How would you add data ?

Declare a method with return type of `Action.Insert`:

```python

from typing import Literal

from git_datasets import dataset, Action

from git_datasets.files import File, jpg

@dataset

class SegmentationDataset:

    image: File[jpg]

    label: Literal["cat", "dog", "person"]

    def get_data_from_web() -> Action.Insert[{

        "image": File[jpg],

        "label": Literal["cat", "dog", "person"],

    }]:

        ... # perform some requests, massage data into the correct form

        return [

            (image_1, label_1),

            (image_2, label_2),

            (image_3, label_3),

            ...

        ]

```

this method is called once on the first time it is commited.

## How would you transform data ?

All data transformations will happen on commit, leaving a traceable history of everything that happened to the dataset. 

For example, I might want to resize the orignal images and encode the label:

```python

from typing import Literal

from git_datasets import dataset, File, jpg, png

@dataset

class SegmentationDataset:

    image: File[jpg]

    label: Literal["cat", "dog", "person"]

    def image_resized_512x512(image: File[jpg]) -> File[png]:

        ... # perform resize

        # return an instance of `File[png]`

        return file

    def encoded_label(label: Literal["cat", "dog", "person"]) -> Literal[0, 1, 2]:

        if label == "cat":

            return 0

        elif label == "dog":

            return 1

        elif label == "person":

            return 2

        else:

            raise ValueError("Not a cat, dog or person !!")

```

Commiting this results in the creation of a new field, `image_resized_512x512` with type `File[png]`, and in the application of the transformation to populate that field. This transform is only applied again if one value happens to be missing.

Additionally, multi-stage transformations are possible:

```python

from typing import Literal

from git_datasets import dataset

from git_datasets import File, jpg, png

@dataset

class SegmentationDataset:

    image: File[jpg]

    label: Literal["cat", "dog", "person"]

    def image_resized_512x512(image: File[jpg]) -> File[png]:

        ... # perform resize

        return file

    def encoded_label(label: Literal["cat", "dog", "person"]) -> Literal[0, 1, 2]:

        if label == "cat":

            return 0

        elif label == "dog":

            return 1

        elif label == "person":

            return 2

        else:

            raise ValueError("Not a cat, dog or person !!")

    def example_field(

        image_resized_512x512: File[png],

        encoded_label: Literal[0, 1, 2],

    ) -> File[png]:

        ... # do stuff

        return file

```

## Transformations on every commit, sounds like it could get annoying.

A lot of transformations will be blocking if you have a large dataset. These are the attenuating factors:

1. Commiting will only lock changes to the `index.py` file, letting you work while you wait for the processing to take place. (live inspection during transformations will be possible via `python index.py --sql-shell` or `python index.py --python-shell` or `python index.py --jupyter-notebook`, etc)

2. It will be possible to mark certain transformations to be skiped (with `@skip`)

3. It will also be possible to mark transformations to be consumed by a github workflow (with `@cicd`)

4. For truly large datasets, an integration with spark will be available. (possibly with `@spark`, but probably a deeper integration, still researching)

```python 

from typing import Literal

from git_datasets import dataset, File, jpg, png

@dataset

class SegmentationDataset:

    image: File[jpg]

    label: Literal["cat", "dog", "person"]

    @skip

    def field_to_skip(image: File[jpg]) -> File[png]:

        ... # perform resize

        return file

    @cicd

    def field_for_cicd(image: File[jpg]) -> File[png]:

        ... # perform resize

        return file

    @spark

    def field_for_spark(image: File[jpg]) -> File[png]:

        ... # perform resize

        return file

    @parallel(n=10)

    def field_in_parallel(image: File[jpg]) -> File[png]:

        ... # perform resize

        return file

    @thread(n=10)

    def field_in_threads(image: File[jpg]) -> File[png]:

        ... # perform resize

        return file

```

Furthermore, commits only cause a transformation when there is a "delta" in the file that requires it.

Adding a new transformation will cause a transformation. Adding a docstring will not cause any transformation to occur. 

Processing only occurs when:

- a new transformation is added

- new data is added

- results from a current transformation are missing

And finally, it will be possible to invert control so that only fields decorated by `@run` are processed. My only requirement is that the `index.py` file never lies. 

## Still, is the transformation on commit thing really necessary ? 

There are two parts to this:

1. The code that is used to execute the transformation

2. The result of a *successful* transformation

By tying these two toguether with a commit, **we have now turned the commit into an imutable snapshot of the dataset**.

Each commit is tied to the resulting (versioned) parquet file which itself points to any resulting files.

## Wait, but what if I want to run some code without commiting ?

You can run `python index.py` just fine. It will run transformations, it just won't save any result. For example you can make a plot:

```python

from typing import Literal

from git_datasets import dataset, File, jpg

@dataset

class SegmentationDataset:

    image: File[jpg]

    label: Literal["cat", "dog", "person"]

    @index(12)

    def plot_some_image(image: File[jpg]) -> None:

        image_array = image.to_numpy_array()

        plt.imshow(image_array)

        plt.show()

```

## What about row transformations ?

For editing individual rows you can use `Action` again:

```python

from typing import Literal

from git_datasets import dataset, File, jpg, png, Action

@dataset

class SegmentationDataset:

    image: File[jpg]

    segmentation: File[png]

    label: str

    def delete_corrupted_files(image: File[jpg]) -> Action.Delete:

        ... # perform some checks, get image_is_corrupted: bool

        return image_is_corrupted

```

Transformations always occur once, on the first time they are commited.

For more control over which rows you are iterating:

```python

from typing import Literal

from git_datasets import dataset, File, jpg, png, Action

@dataset

class SegmentationDataset:

    image: File[jpg]

    segmentation: File[png]

    label: str

    @range(1, 10, 3)

    def transformation_2(image: File[jpg]) -> Action.Delete:

        ... # perform some checks

        return False if image_checks_out else True

    @index(10)

    def transformation_3() -> Action.Delete:

        ... # perform some checks

        return True

    @index(11)

    def transformation_4(image: File[jpg], segmentation: File[png]) -> Action.Alter:

        ... # get data

        return new_image, new_segmentation

```

You can also declare `None` as the return type for no action. This is useful if you want to implement some check:

```python

from typing import Literal

from git_datasets import dataset, File, jpg, png

@dataset

class SegmentationDataset:

    image: File[jpg]

    image_segmentation: File[png]

    def ensure_rgb(image: File[jpg]) -> None:

        ... # load image

        assert image.size[2] == 3

```

## Where are files going ?

Files go into the `.git` folder, but they are uploaded to your chosen cloud provider.

For example, you'd link your repository to a bucket via

```

git datasets link --provider "AWS" --bucket $AWS_BUCKET`

```

you'd do this once, but credentions would be provided on a per user basis.

## But my dataset is large, I can't fit it into my computer.

You setup a memory limit. Once that limit is reached, only a snapshot of the dataset is kept. Files are cycled on demand as needed.

## How important are the type hints ?

Critical. When you commit a transformation you get to keep it in the code without it being run again on each new commit. This is good, it serves as documentation. But if you alter the schema in such a way that the transformation now does not make sense, the index.py file would now be lying to you. You would end up with something like:

```python

from typing import Literal

from git_datasets import dataset, File, jpg, png, Action

@dataset

class SegmentationDataset:

    image_segmentation: File[png]

    @index(11)

    def transformation_4(image: File[png]) -> Action.Alter:

        ... # get data

        return new_image

```

note how the type hints of `transformation_4` clearly state that the transformation is applied to a field that does not exist anymore.

By having the type hints, I know that I need to throw an error and prevent the commit from happening.

## Is git handling the files ?

No, large files are uploaded to your chosen cloud provider. Git will version the `index.py` file. 

## What about scale and integrity? 

I'm planing data deduplication schemes and data integrity guarantees via checksums. 

## What happens when there is a merge conflict ?

Merge conflicts are resolved directly in the `index.py` file. 

During a merge, the state of the schema is decided (via index.py), the data from both commits is merged and transformations are applied to fill any empty fields.

## What happens if someone commits without `git-datasets` ?

This will happen. A merge conflict might get resolved on github's interface. Someone might commit without having git datasets installed. Etc.

If the commits make sense, that is, the dataset can be constructed as usual by following the transformations on each commit, that's what will happen.

If some commit does not make sense and generates integrity issues. The commits will be marked as corrupt and the `index.py` file is simply reverted to the last non-corrupt commit via a revert commit.

Git status will always indicate if the index.py files are truthful or not. Checking out commits and branches will always issue a warning if an index.py file is not truthful. The lest non-corrupted commit is included in the large commit message.

## Dependency on git ? Isn't it a large learning curve, especially for someone not familiar to git ?

Yes, I love git, this a git extension. 

For someone who uses git, this will be second nature to them. That is my objective at least.

Since this is a git extension, anyone not familiar with git must learn it first !

## Examples

### Medical dataset

```python

from typing import Literal

from git_datasets import dataset, File, jpg, png, txt, dicom, Action

@dataset

class MedicalDiagnosisDataset:

    patient_id: str

    age: int

    weight: float

    height: float

    mri_scan: File[dicom]

    radiologist_note: File[txt]

    diagnosis: str

    # Initial method to populate the dataset from the hospital database

    def fetch_initial_data() -> Action.Insert[{

        "patient_id": str,

        "age": int,

        "weight": float,

        "height": float,

        "mri_scan": File[dicom],

        "radiologist_note": File[txt],

        "diagnosis": str,

    }]:

        ... # fetch from a medical DB, ensuring data privacy and de-identification

        return [

            ("patient_001", 45, 70.5, 175.0, mri_1, note_1, "Benign"),

            ("patient_002", 56, 80.2, 180.0, mri_2, note_2, "Malignant"),

            ...

        ]

    # Field representing normalized MRI scans

    def normalized_mri(mri_scan: File[dicom]) -> File[dicom]:

        ... # apply some normalization techniques on the MRI scan

        return processed_mri

    # Field representing the summarized points from radiologist's notes

    def radiologist_key_findings(radiologist_note: File[txt]) -> str:

        ... # use NLP techniques to extract essential points

        return findings_summary

    # Vertical transformation to exclude patients below a certain age

    def filter_by_age(age: int) -> Action.Delete:

        return age < 18

    # Verification that MRI scans meet certain quality criteria

    def ensure_mri_quality(mri_scan: File[dicom]) -> None:

        ... # load the dicom file and check its properties

        assert quality_check(mri_scan)

```

## Previous Work

- https://github.com/iterative/dvc

- https://github.com/dolthub/dolt

## Important stuff 

- https://spark.apache.org/

- https://parquet.apache.org/

- https://delta.io/

- https://libcloud.apache.org

(back to top)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ruifilipecampos/git-datasets

Awesome Lists containing this project

README