{"id":16602660,"url":"https://github.com/ruifilipecampos/git-datasets","last_synced_at":"2025-10-29T13:32:11.422Z","repository":{"id":199362053,"uuid":"702719427","full_name":"RuiFilipeCampos/git-datasets","owner":"RuiFilipeCampos","description":"Declaratively create, transform, manage and version ML datasets.","archived":false,"fork":false,"pushed_at":"2023-11-03T08:32:16.000Z","size":123,"stargazers_count":4,"open_issues_count":7,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-02T01:51:16.475Z","etag":null,"topics":["ai","data-version-control","datasets","git","machine-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RuiFilipeCampos.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-10-09T21:48:29.000Z","updated_at":"2024-04-04T12:22:25.000Z","dependencies_parsed_at":"2023-11-01T14:38:15.555Z","dependency_job_id":null,"html_url":"https://github.com/RuiFilipeCampos/git-datasets","commit_stats":null,"previous_names":["ruifilipecampos/datasets"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RuiFilipeCampos%2Fgit-datasets","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RuiFilipeCampos%2Fgit-datasets/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RuiFilipeCampos%2Fgit-datasets/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RuiFilipeCampos%2Fgit-datasets/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RuiFilipeCampos","download_url":"https://codeload.github.com/RuiFilipeCampos/git-datasets/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238832360,"owners_count":19538273,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","data-version-control","datasets","git","machine-learning"],"created_at":"2024-10-12T00:23:10.117Z","updated_at":"2025-10-29T13:32:06.079Z","avatar_url":"https://github.com/RuiFilipeCampos.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n\n\n\n\u003c!-- Improved compatibility of back to top link: See: https://github.com/othneildrew/Best-README-Template/pull/73 --\u003e\n\u003ca name=\"readme-top\"\u003e\u003c/a\u003e\n\n\n\n\u003c!-- PROJECT LOGO --\u003e\n\u003cbr/\u003e\n\u003cdiv align=\"center\"\u003e\n  \u003ca href=\"https://github.com/RuiFilipeCampos/git-datasets\"\u003e\n    \u003cimg src=\"https://github.com/RuiFilipeCampos/git-datasets/assets/63464503/e0885e59-865e-48f2-bdb5-3113102522fc\" alt=\"Logo\" width=\"260\" height=\"260\"\u003e\n  \u003c/a\u003e\n\n\n\n  \u003cp align=\"center\"\u003e\n    Declaratively create, transform, and manage ML datasets.\n    \u003cbr /\u003e\n\n  \u003c/p\u003e\n\u003c/div\u003e\n\n\u003c!-- ABOUT THE PROJECT --\u003e\n## What are you trying to do?\n\n**At its core, git-datasets is an attempt at introducing a \"data as code\" paradigm.** Imagine being able to commit, revert, restore, pull, push, merge, resolve conflicts, open a PR and review data just as you do with code. All right from git and with minimal setup.\n\n## How are you planning to do that ?\n\nEvery dataset has an `index.py` file. The promise? A committed `index.py` always tells the truth. The consequence is that every transformation occurs on `git commit index.py`. \n\nLet's say you want a dataset with images and object segmentations. Your `index.py` could look like this:\n\n```python\nfrom typing import Literal\nfrom git_datasets import dataset\nfrom git_datasets.files import File, jpg\n\n@dataset\nclass ImageClassificationDataset:\n    image: File[jpg]\n    label: Literal[\"cat\", \"dog\", \"person\"]\n```\n\nFirst, you set this up as a dataset using `git datasets new index.py`. When you save it with `git commit index.py` a parquet file is created with that schema. This file is hidden out of view in the `.git` folder and when you run `git push` it is uploaded to a chosen cloud provider ([apache-libcloud](https://libcloud.apache.org) will be used to support all providers)\n\n## How would you add data ?\n\nDeclare a method with return type of `Action.Insert`:\n\n```python\nfrom typing import Literal\nfrom git_datasets import dataset, Action\nfrom git_datasets.files import File, jpg\n\n@dataset\nclass SegmentationDataset:\n    image: File[jpg]\n    label: Literal[\"cat\", \"dog\", \"person\"]\n\n    def get_data_from_web() -\u003e Action.Insert[{\n        \"image\": File[jpg],\n        \"label\": Literal[\"cat\", \"dog\", \"person\"],\n    }]:\n\n        ... # perform some requests, massage data into the correct form\n\n        return [\n            (image_1, label_1),\n            (image_2, label_2),\n            (image_3, label_3),\n            ...\n        ]\n```\n\nthis method is called once on the first time it is commited.\n\n\n## How would you transform data ?\n\nAll data transformations will happen on commit, leaving a traceable history of everything that happened to the dataset. \n\nFor example, I might want to resize the orignal images and encode the label:\n\n```python\nfrom typing import Literal\nfrom git_datasets import dataset, File, jpg, png\n\n\n@dataset\nclass SegmentationDataset:\n    image: File[jpg]\n    label: Literal[\"cat\", \"dog\", \"person\"]\n\n    def image_resized_512x512(image: File[jpg]) -\u003e File[png]:\n\n        ... # perform resize\n\n        # return an instance of `File[png]`\n        return file\n\n    def encoded_label(label: Literal[\"cat\", \"dog\", \"person\"]) -\u003e Literal[0, 1, 2]:\n        if label == \"cat\":\n            return 0\n        elif label == \"dog\":\n            return 1\n        elif label == \"person\":\n            return 2\n        else:\n            raise ValueError(\"Not a cat, dog or person !!\")\n\n```\n\nCommiting this results in the creation of a new field, `image_resized_512x512` with type `File[png]`, and in the application of the transformation to populate that field. This transform is only applied again if one value happens to be missing.\n\nAdditionally, multi-stage transformations are possible:\n\n\n```python\nfrom typing import Literal\nfrom git_datasets import dataset\nfrom git_datasets import File, jpg, png\n\n@dataset\nclass SegmentationDataset:\n    image: File[jpg]\n    label: Literal[\"cat\", \"dog\", \"person\"]\n\n    def image_resized_512x512(image: File[jpg]) -\u003e File[png]:\n\n        ... # perform resize\n\n        return file\n\n    def encoded_label(label: Literal[\"cat\", \"dog\", \"person\"]) -\u003e Literal[0, 1, 2]:\n        if label == \"cat\":\n            return 0\n        elif label == \"dog\":\n            return 1\n        elif label == \"person\":\n            return 2\n        else:\n            raise ValueError(\"Not a cat, dog or person !!\")\n\n    def example_field(\n        image_resized_512x512: File[png],\n        encoded_label: Literal[0, 1, 2],\n    ) -\u003e File[png]:\n\n        ... # do stuff\n\n        return file\n\n```\n\n## Transformations on every commit, sounds like it could get annoying.\n\n\nA lot of transformations will be blocking if you have a large dataset. These are the attenuating factors:\n\n1. Commiting will only lock changes to the `index.py` file, letting you work while you wait for the processing to take place. (live inspection during transformations will be possible via `python index.py --sql-shell` or `python index.py --python-shell` or `python index.py --jupyter-notebook`, etc)\n2. It will be possible to mark certain transformations to be skiped (with `@skip`)\n3. It will also be possible to mark transformations to be consumed by a github workflow (with `@cicd`)\n4. For truly large datasets, an integration with spark will be available. (possibly with `@spark`, but probably a deeper integration, still researching)\n\n\n\n```python \nfrom typing import Literal\nfrom git_datasets import dataset, File, jpg, png\n\n@dataset\nclass SegmentationDataset:\n    image: File[jpg]\n    label: Literal[\"cat\", \"dog\", \"person\"]\n\n    @skip\n    def field_to_skip(image: File[jpg]) -\u003e File[png]:\n\n        ... # perform resize\n\n        return file\n\n    @cicd\n    def field_for_cicd(image: File[jpg]) -\u003e File[png]:\n\n        ... # perform resize\n\n        return file\n\n    @spark\n    def field_for_spark(image: File[jpg]) -\u003e File[png]:\n\n        ... # perform resize\n\n        return file\n\n    @parallel(n=10)\n    def field_in_parallel(image: File[jpg]) -\u003e File[png]:\n\n        ... # perform resize\n\n        return file\n\n    @thread(n=10)\n    def field_in_threads(image: File[jpg]) -\u003e File[png]:\n\n        ... # perform resize\n\n        return file\n\n```\n\nFurthermore, commits only cause a transformation when there is a \"delta\" in the file that requires it.\n\nAdding a new transformation will cause a transformation. Adding a docstring will not cause any transformation to occur. \n\nProcessing only occurs when:\n\n- a new transformation is added\n- new data is added\n- results from a current transformation are missing\n\nAnd finally, it will be possible to invert control so that only fields decorated by `@run` are processed. My only requirement is that the `index.py` file never lies. \n\n## Still, is the transformation on commit thing really necessary ? \n\nThere are two parts to this:\n\n1. The code that is used to execute the transformation\n2. The result of a *successful* transformation\n\nBy tying these two toguether with a commit, **we have now turned the commit into an imutable snapshot of the dataset**.\n\nEach commit is tied to the resulting (versioned) parquet file which itself points to any resulting files.\n\n\n## Wait, but what if I want to run some code without commiting ?\n\nYou can run `python index.py` just fine. It will run transformations, it just won't save any result. For example you can make a plot:\n\n```python\nfrom typing import Literal\nfrom git_datasets import dataset, File, jpg\n\n@dataset\nclass SegmentationDataset:\n    image: File[jpg]\n    label: Literal[\"cat\", \"dog\", \"person\"]\n\n    @index(12)\n    def plot_some_image(image: File[jpg]) -\u003e None:\n        image_array = image.to_numpy_array()\n        plt.imshow(image_array)\n        plt.show()\n```\n\n\n## What about row transformations ?\n\nFor editing individual rows you can use `Action` again:\n\n```python\nfrom typing import Literal\nfrom git_datasets import dataset, File, jpg, png, Action\n\n@dataset\nclass SegmentationDataset:\n    image: File[jpg]\n    segmentation: File[png]\n    label: str\n\n    def delete_corrupted_files(image: File[jpg]) -\u003e Action.Delete:\n\n        ... # perform some checks, get image_is_corrupted: bool\n\n        return image_is_corrupted\n```\n\nTransformations always occur once, on the first time they are commited.\n\nFor more control over which rows you are iterating:\n\n```python\nfrom typing import Literal\nfrom git_datasets import dataset, File, jpg, png, Action\n\n@dataset\nclass SegmentationDataset:\n    image: File[jpg]\n    segmentation: File[png]\n    label: str\n\n    @range(1, 10, 3)\n    def transformation_2(image: File[jpg]) -\u003e Action.Delete:\n\n        ... # perform some checks\n\n        return False if image_checks_out else True\n\n    @index(10)\n    def transformation_3() -\u003e Action.Delete:\n\n        ... # perform some checks\n\n        return True\n\n    @index(11)\n    def transformation_4(image: File[jpg], segmentation: File[png]) -\u003e Action.Alter:\n\n        ... # get data\n\n        return new_image, new_segmentation\n```\n\nYou can also declare `None` as the return type for no action. This is useful if you want to implement some check:\n\n```python\nfrom typing import Literal\nfrom git_datasets import dataset, File, jpg, png\n\n@dataset\nclass SegmentationDataset:\n    image: File[jpg]\n    image_segmentation: File[png]\n\n    def ensure_rgb(image: File[jpg]) -\u003e None:\n\n        ... # load image\n\n        assert image.size[2] == 3\n\n```\n\n\n## Where are files going ?\n\nFiles go into the `.git` folder, but they are uploaded to your chosen cloud provider.\n\nFor example, you'd link your repository to a bucket via\n\n```\ngit datasets link --provider \"AWS\" --bucket $AWS_BUCKET`\n```\n\nyou'd do this once, but credentions would be provided on a per user basis.\n\n\n## But my dataset is large, I can't fit it into my computer.\n\nYou setup a memory limit. Once that limit is reached, only a snapshot of the dataset is kept. Files are cycled on demand as needed.\n\n\n## How important are the type hints ?\n\nCritical. When you commit a transformation you get to keep it in the code without it being run again on each new commit. This is good, it serves as documentation. But if you alter the schema in such a way that the transformation now does not make sense, the index.py file would now be lying to you. You would end up with something like:\n\n```python\nfrom typing import Literal\nfrom git_datasets import dataset, File, jpg, png, Action\n\n@dataset\nclass SegmentationDataset:\n\n    image_segmentation: File[png]\n\n    @index(11)\n    def transformation_4(image: File[png]) -\u003e Action.Alter:\n\n        ... # get data\n\n        return new_image\n\n```\n\nnote how the type hints of `transformation_4` clearly state that the transformation is applied to a field that does not exist anymore.\n\nBy having the type hints, I know that I need to throw an error and prevent the commit from happening.\n\n## Is git handling the files ?\n\nNo, large files are uploaded to your chosen cloud provider. Git will version the `index.py` file. \n\n\n## What about scale and integrity? \n\nI'm planing data deduplication schemes and data integrity guarantees via checksums. \n\n\n\n\n## What happens when there is a merge conflict ?\n\nMerge conflicts are resolved directly in the `index.py` file. \n\nDuring a merge, the state of the schema is decided (via index.py), the data from both commits is merged and transformations are applied to fill any empty fields.\n\n## What happens if someone commits without `git-datasets` ?\n\nThis will happen. A merge conflict might get resolved on github's interface. Someone might commit without having git datasets installed. Etc.\n\nIf the commits make sense, that is, the dataset can be constructed as usual by following the transformations on each commit, that's what will happen.\n\nIf some commit does not make sense and generates integrity issues. The commits will be marked as corrupt and the `index.py` file is simply reverted to the last non-corrupt commit via a revert commit.\n\nGit status will always indicate if the index.py files are truthful or not. Checking out commits and branches will always issue a warning if an index.py file is not truthful. The lest non-corrupted commit is included in the large commit message.\n\n\n\n\n\n## Dependency on git ? Isn't it a large learning curve, especially for someone not familiar to git ?\n\nYes, I love git, this a git extension. \n\nFor someone who uses git, this will be second nature to them. That is my objective at least.\n\nSince this is a git extension, anyone not familiar with git must learn it first !\n\n## Examples\n\n### Medical dataset\n\n```python\nfrom typing import Literal\nfrom git_datasets import dataset, File, jpg, png, txt, dicom, Action\n\n@dataset\nclass MedicalDiagnosisDataset:\n    patient_id: str\n    age: int\n    weight: float\n    height: float\n    mri_scan: File[dicom]\n    radiologist_note: File[txt]\n    diagnosis: str\n\n    # Initial method to populate the dataset from the hospital database\n    def fetch_initial_data() -\u003e Action.Insert[{\n        \"patient_id\": str,\n        \"age\": int,\n        \"weight\": float,\n        \"height\": float,\n        \"mri_scan\": File[dicom],\n        \"radiologist_note\": File[txt],\n        \"diagnosis\": str,\n    }]:\n        ... # fetch from a medical DB, ensuring data privacy and de-identification\n        return [\n            (\"patient_001\", 45, 70.5, 175.0, mri_1, note_1, \"Benign\"),\n            (\"patient_002\", 56, 80.2, 180.0, mri_2, note_2, \"Malignant\"),\n            ...\n        ]\n\n    # Field representing normalized MRI scans\n    def normalized_mri(mri_scan: File[dicom]) -\u003e File[dicom]:\n        ... # apply some normalization techniques on the MRI scan\n        return processed_mri\n\n    # Field representing the summarized points from radiologist's notes\n    def radiologist_key_findings(radiologist_note: File[txt]) -\u003e str:\n        ... # use NLP techniques to extract essential points\n        return findings_summary\n\n    # Vertical transformation to exclude patients below a certain age\n    def filter_by_age(age: int) -\u003e Action.Delete:\n        return age \u003c 18\n\n    # Verification that MRI scans meet certain quality criteria\n    def ensure_mri_quality(mri_scan: File[dicom]) -\u003e None:\n        ... # load the dicom file and check its properties\n        assert quality_check(mri_scan)\n```\n\n\n## Previous Work\n\n- https://github.com/iterative/dvc\n- https://github.com/dolthub/dolt\n\n## Important stuff \n\n- https://spark.apache.org/\n- https://parquet.apache.org/\n- https://delta.io/\n- https://libcloud.apache.org\n\n\u003cp align=\"right\"\u003e(\u003ca href=\"#readme-top\"\u003eback to top\u003c/a\u003e)\u003c/p\u003e\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fruifilipecampos%2Fgit-datasets","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fruifilipecampos%2Fgit-datasets","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fruifilipecampos%2Fgit-datasets/lists"}