{"id":34113723,"url":"https://github.com/aria-ml/dataeval","last_synced_at":"2026-04-11T01:23:32.952Z","repository":{"id":208900380,"uuid":"722722703","full_name":"aria-ml/dataeval","owner":"aria-ml","description":"Python library for analyzing data quality and its impact on model performance across classification and object-detection tasks.","archived":false,"fork":false,"pushed_at":"2026-02-10T21:39:52.000Z","size":1212447,"stargazers_count":13,"open_issues_count":1,"forks_count":4,"subscribers_count":2,"default_branch":"main","last_synced_at":"2026-02-10T22:46:33.227Z","etag":null,"topics":["ai","bias-detection","data","linter","metrics","out-of-distribution-detection","outlier-detection","sufficiency"],"latest_commit_sha":null,"homepage":"https://dataeval.readthedocs.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aria-ml.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-11-23T19:41:03.000Z","updated_at":"2026-02-10T21:31:15.000Z","dependencies_parsed_at":"2025-12-17T06:00:52.224Z","dependency_job_id":null,"html_url":"https://github.com/aria-ml/dataeval","commit_stats":null,"previous_names":["aria-ml/daml","aria-ml/dataeval"],"tags_count":169,"template":false,"template_full_name":null,"purl":"pkg:github/aria-ml/dataeval","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aria-ml%2Fdataeval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aria-ml%2Fdataeval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aria-ml%2Fdataeval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aria-ml%2Fdataeval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aria-ml","download_url":"https://codeload.github.com/aria-ml/dataeval/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aria-ml%2Fdataeval/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29528215,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-17T00:57:22.232Z","status":"ssl_error","status_checked_at":"2026-02-17T00:54:25.811Z","response_time":115,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","bias-detection","data","linter","metrics","out-of-distribution-detection","outlier-detection","sufficiency"],"created_at":"2025-12-14T19:18:05.156Z","updated_at":"2026-04-03T00:26:28.220Z","avatar_url":"https://github.com/aria-ml.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!-- markdownlint-disable MD041 --\u003e\n\n![dataeval-logo](docs/source/_static/images/DataEval_ImageText.png)\n\n\u003c!-- :auto badges: --\u003e\n\n[![PyPI - Python Version](https://img.shields.io/pypi/v/dataeval)](https://pypi.org/project/dataeval/)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/dataeval)\n[![Documentation Status](https://readthedocs.org/projects/dataeval/badge/?version=latest)](https://dataeval.readthedocs.io/en/latest/?badge=latest)\n\n\u003c!-- :auto badges: --\u003e\n\n# DataEval\n\n\u003e DataEval analyzes datasets and models to give users the ability to train and\n\u003e test performant, unbiased, and reliable AI models and monitor data for\n\u003e impactful shifts to deployed models.\n\nThe `dataeval` package provides a rigorous and reliable set of tools for developing\nand analyzing computer vision datasets and the resulting impact on models.\n\nTo view our extensive collection of tutorials, how-to's, explanation guides,\nand reference material, please visit our documentation on\n**[Read the Docs](https://dataeval.readthedocs.io/)**\n\n## Why DataEval?\n\n\u003c!-- start needs --\u003e\n\nDataEval addresses the critical need underlying every AI model -- the data.\nThe difference between a great dataset and a poor dataset can have drastic\nconsequences on AI model performance. Data collected in the wild is noisy,\noften imbalanced, and doesn't always cover the entire spectrum of conditions\nneed for deployment. DataEval provides AI practitioners with a library of\nrigorous, algorithm-backed metrics for performance estimation, bias analysis,\ndataset cleaning and assessment, and data distribution shifts. Throughout\nall stages of the machine learning lifecycle -- from initial data collection\nthrough operational monitoring -- DataEval identifies data problems before\nthey become model failures.\n\nDataEval is easy to install, supports a wide range of Python versions, and is\ncompatible with many of the most popular packages in the scientific and T\u0026E\ncommunities.\n\n\u003c!-- end needs --\u003e\n\n### Target Audience\n\n\u003c!-- start JATIC interop --\u003e\n\nDataEval is intended to help data scientists, developers, and T\u0026E engineers\nwho want to evaluate and enhance their datasets for optimum performance. For\nusers of the JATI product suite, DataEval has native interoperability when\nusing MAITE-compliant datasets and models.\n\n\u003c!-- end JATIC interop --\u003e\n\n---\n\n## Getting Started\n\n**Python versions:** 3.10 - 3.14\n\nChoose your preferred method of installation below or follow our\n[installation guide](docs/source/getting-started/installation.md).\n\n- [Installing with pip](#installing-with-pip)\n- [Installing with conda/mamba](#installing-with-conda)\n- [Installing from GitHub](#installing-from-github)\n\n### **Installing with pip**\n\nYou can install DataEval directly from pypi.org using the following command.\n\n```bash\npip install dataeval\n```\n\n### **Installing with conda**\n\nDataEval can be installed in a Conda/Mamba environment using the provided\n`environment.yml` file. As some dependencies are installed from the `pytorch`\nchannel, the channel is specified in the below example.\n\n```bash\nmicromamba create -f environment\\environment.yml -c pytorch\n```\n\n### **Installing from GitHub**\n\nTo install DataEval from source locally on Ubuntu, pull the source down and\nchange to the DataEval project directory.\n\n```bash\ngit clone https://github.com/aria-ml/dataeval.git\ncd dataeval\n```\n\n#### **Using Poetry**\n\nInstall DataEval.\n\n```bash\npoetry install\n```\n\nEnable Poetry's virtual environment.\n\n```bash\npoetry env activate\n```\n\n#### **Using uv**\n\nInstall DataEval with dependencies for development.\n\n```bash\nuv sync\n```\n\nEnable uv's virtual environment.\n\n```bash\nsource .venv/bin/activate\n```\n\n### Working with data\n\nDataEval has two input paths depending on which part of the library you are using.\n\n**`dataeval.core`** provides stateless functions that operate directly on NumPy\narrays — embeddings, labels, image hashes, and statistics. No dataset object is\nrequired. Call these functions with arrays and get results back directly. Examples\ninclude `compute_stats`, `label_errors`, `divergence_mst`, and `ber_knn`.\n\n**`dataeval.quality`, `dataeval.bias`, `dataeval.shift`, and `dataeval.performance`**\nprovide stateful evaluator classes (`Duplicates`, `Outliers`,\n`Prioritize`, `Balance`, drift detectors, and so on). These\naccept either NumPy arrays or [Modular AI Trustworthy Engineering\n(MAITE)](https://github.com/mit-ll-ai-technology/maite)-compliant datasets depending on the evaluator.\n\nIf your data is not yet in MAITE format, the sections below show what is\nrequired and how to wrap a common format, for both image classification and\nobject detection tasks.\n\n#### Image classification dataset\n\nA MAITE-compliant image classification dataset implements `__len__` and\n`__getitem__`, where each item is a tuple of `(image, label, metadata)`.\nImages must be NumPy arrays of shape `(H, W, C)`. Labels must be one-hot\nencoded arrays of shape `(num_classes,)`. Metadata must be a `DatumMetadata`\nobject with at minimum an `id` field.\n\n```python\nimport maite.protocols as mp\nimport maite.protocols.image_classification as ic\nimport numpy as np\n\n\nclass MyImageClassificationDataset(ic.Dataset):\n    metadata: mp.DatasetMetadata\n\n    def __init__(self, images: list[np.ndarray], labels: list[int], num_classes: int) -\u003e None:\n        # images: list of np.ndarray, each shape (H, W, C)\n        # labels: list of int (class indices)\n        self._images = images\n        self._labels = labels\n        self._num_classes = num_classes\n\n        self.metadata = mp.DatasetMetadata(\n            id=\"my_image_classification_dataset\",\n            index2label={i: f\"class_{i}\" for i in np.unique(labels)},  # example mapping\n        )\n\n    def __len__(self) -\u003e int:\n        return len(self._images)\n\n    def __getitem__(self, idx: int) -\u003e tuple[ic.InputType, ic.TargetType, ic.DatumMetadataType]:\n        return (\n            self._images[idx],  # np.ndarray (H, W, C)\n            np.eye(self._num_classes, dtype=np.float32)[self._labels[idx]],  # np.ndarray (num_classes,)\n            ic.DatumMetadataType(id=idx),\n        )\n```\n\n#### Object detection dataset\n\nA MAITE-compliant object detection dataset follows the same three-tuple\nstructure, but the label element is replaced by a detection target object\ncarrying per-box labels, bounding boxes, and scores. Bounding boxes use\n`(x0, y0, x1, y1)` format. Labels and scores are per-box, not per-image.\n\n```python\nimport maite.protocols as mp\nimport maite.protocols.object_detection as od\nimport numpy as np\n\n\nclass DetectionTarget(od.TargetType):\n    \"\"\"Holds per-box labels, boxes, and one-hot scores for one image.\"\"\"\n\n    def __init__(self, labels: list[int], boxes: list[list[float]], num_classes: int):\n        # labels: list of int, one per box\n        # boxes:  list of [x0, y0, x1, y1], one per box\n        self._labels = labels\n        self._boxes = boxes\n        self._scores = np.eye(num_classes)[labels]\n\n    @property\n    def labels(self) -\u003e mp.ArrayLike:\n        return self._labels\n\n    @property\n    def boxes(self) -\u003e mp.ArrayLike:\n        return self._boxes\n\n    @property\n    def scores(self) -\u003e mp.ArrayLike:\n        return self._scores\n\n\nclass MyObjectDetectionDataset(od.Dataset):\n    def __init__(\n        self, images: list[np.ndarray], labels: list[list[int]], boxes: list[list[list[float]]], num_classes: int\n    ) -\u003e None:\n        # images: list of np.ndarray, each shape (H, W, C)\n        # labels: list of list[int] — per-box class indices, one list per image\n        # boxes:  list of list[[x0,y0,x1,y1]] — one list per image\n        self._images = images\n        self._labels = labels\n        self._boxes = boxes\n        self._num_classes = num_classes\n\n        self.metadata = mp.DatasetMetadata(\n            id=\"my_object_detection_dataset\",\n            index2label={i: f\"class_{i}\" for i in np.unique(labels)},  # example mapping\n        )\n\n    def __len__(self) -\u003e int:\n        return len(self._images)\n\n    def __getitem__(self, idx: int) -\u003e tuple[od.InputType, od.TargetType, od.DatumMetadataType]:\n        return (\n            self._images[idx],  # np.ndarray (H, W, C)\n            DetectionTarget(self._labels[idx], self._boxes[idx], self._num_classes),\n            od.DatumMetadataType(id=idx),\n        )\n```\n\n#### Wrapping a PyTorch dataset\n\nIf your data is in a PyTorch `Dataset`, wrap it to conform to the MAITE\nprotocol. Note that `torchvision` tensors are `(C, H, W)` — permute to\n`(H, W, C)` before passing to DataEval.\n\n```python\nimport maite.protocols as mp\nimport maite.protocols.image_classification as ic\nimport numpy as np\nimport torch\nfrom torchvision import transforms\nfrom torchvision.datasets import CIFAR10\n\ntv_cifar10 = CIFAR10(root=\"./data\", train=True, download=True, transform=transforms.ToTensor())\n\n\nclass MyCIFAR10Wrapper(ic.Dataset):\n    def __init__(self, source: CIFAR10) -\u003e None:\n        self._source = source\n        self.metadata = mp.DatasetMetadata(\n            id=\"tv_cifar10\",\n            index2label={\n                0: \"airplane\",\n                1: \"automobile\",\n                2: \"bird\",\n                3: \"cat\",\n                4: \"deer\",\n                5: \"dog\",\n                6: \"frog\",\n                7: \"horse\",\n                8: \"ship\",\n                9: \"truck\",\n            },\n        )\n\n    def __len__(self) -\u003e int:\n        return len(tv_cifar10)\n\n    def __getitem__(self, idx: int) -\u003e tuple[ic.InputType, ic.TargetType, ic.DatumMetadataType]:\n        tv_datum: tuple[torch.Tensor, int] = tv_cifar10[idx]\n        image = tv_datum[0].permute(1, 2, 0).numpy()  # Permute image from (C, H, W) to (H, W, C)\n        label = np.eye(10, dtype=np.float32)[tv_datum[1]]  # Convert label to one-hot encoding\n        return image, label, mp.DatumMetadata(id=idx)\n\n\ndataset: ic.Dataset = MyCIFAR10Wrapper(tv_cifar10)\n```\n\n### Run your first evaluation\n\nThe example below uses `Duplicates` from `dataeval.quality` to detect\nnear-duplicate images by finding groups of embeddings that are similar in\nembedding space. Duplicates inflate benchmark scores and cause models to overfit\nto repeated collection events rather than generalizing to new conditions.\n\n```python\nfrom torch.nn import Flatten\n\nfrom dataeval.extractors import TorchExtractor\nfrom dataeval.flags import ImageStats\nfrom dataeval.quality import Duplicates\n\n# Configure a feature extractor using a pre-trained PyTorch model.\n# Here we use a simple Flatten layer for demonstration, but in practice\n# you would use a more powerful model like a pre-trained ResNet or ViT.\nextractor = TorchExtractor(Flatten())\n\n# Find near-duplicates using only embedding-based clustering.\n# An aggressive cluster_threshold of 1.5 should produce detections\n# of near duplicates even with a simple Flatten extractor.\nevaluator = Duplicates(\n    flags=ImageStats.NONE,\n    cluster_algorithm=\"hdbscan\",\n    cluster_threshold=1.5,\n    extractor=extractor,\n    batch_size=64,\n)\nresult = evaluator.evaluate(dataset)\n\n# Near duplicates are grouped into sets of indices that are within\n# the specified cluster_threshold in embedding space.\nprint(result)\n```\n\n```text\nshape: (3, 5)\n┌──────────┬───────┬──────────┬────────────────┬─────────────┐\n│ group_id ┆ level ┆ dup_type ┆ item_indices   ┆ methods     │\n│ ---      ┆ ---   ┆ ---      ┆ ---            ┆ ---         │\n│ i64      ┆ str   ┆ str      ┆ list[i64]      ┆ list[str]   │\n╞══════════╪═══════╪══════════╪════════════════╪═════════════╡\n│ 0        ┆ item  ┆ near     ┆ [18586, 39942] ┆ [\"cluster\"] │\n│ 1        ┆ item  ┆ near     ┆ [23157, 31426] ┆ [\"cluster\"] │\n│ 2        ┆ item  ┆ near     ┆ [32024, 49135] ┆ [\"cluster\"] │\n└──────────┴───────┴──────────┴────────────────┴─────────────┘\n```\n\nA result with many large groups is a signal that your dataset contains\nrepeated collection events. Before training, remove all but one sample from\neach group. See the [deduplication how-to guide](./docs/source/notebooks/h2_deduplicate.py)\nfor a complete walkthrough, including how to choose which sample to keep.\n\n### Where to go next\n\nNot sure what to evaluate first? Use the [Which tool should I use?](./docs/source/getting-started/which-tool.md)\nguide to find the right evaluator for your situation.\n\nKnow which tool to use, then check out the [Functional Overview](./docs/source/reference/FunctionalOverview.md)\nfor a quick-reference table of each algorithm's inputs, outputs, and task applicability.\n\nWant to just explore the documentation? The [Where to go next](./docs/source/getting-started/where-to-go-next.md)\npage allows you to jump around between the different areas of the documentation with small summaries of what each page covers.\n\n---\n\n## Contact Us\n\nIf you have any questions, feel free to reach out to [us](mailto:dataeval@ariacoustics.com)!\n\n## Acknowledgement\n\n### CDAO Funding Acknowledgement\n\n\u003c!-- start acknowledgement --\u003e\n\nThis material is based upon work supported by the Chief Digital and Artificial\nIntelligence Office under Contract No. W519TC-23-9-2033. The views and\nconclusions contained herein are those of the author(s) and should not be\ninterpreted as necessarily representing the official policies or endorsements,\neither expressed or implied, of the U.S. Government.\n\n\u003c!-- end acknowledgement --\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faria-ml%2Fdataeval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faria-ml%2Fdataeval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faria-ml%2Fdataeval/lists"}