https://github.com/cleanlab/cleanvision
Automatically find issues in image datasets and practice data-centric computer vision.
https://github.com/cleanlab/cleanvision
computer-vision data-centric-ai data-exploration data-profiling data-quality data-science data-validation deep-learning exploratory-data-analysis image-analysis image-classification image-generation image-quality image-segmentation
Last synced: 3 months ago
JSON representation
Automatically find issues in image datasets and practice data-centric computer vision.
- Host: GitHub
- URL: https://github.com/cleanlab/cleanvision
- Owner: cleanlab
- License: agpl-3.0
- Created: 2022-05-26T07:14:11.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2025-04-03T05:19:12.000Z (12 months ago)
- Last Synced: 2025-04-09T22:09:26.453Z (12 months ago)
- Topics: computer-vision, data-centric-ai, data-exploration, data-profiling, data-quality, data-science, data-validation, deep-learning, exploratory-data-analysis, image-analysis, image-classification, image-generation, image-quality, image-segmentation
- Language: Python
- Homepage: https://cleanvision.readthedocs.io/
- Size: 2.12 MB
- Stars: 1,068
- Watchers: 16
- Forks: 73
- Open Issues: 30
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README

CleanVision automatically detects potential issues in image datasets like images that are: blurry, under/over-exposed, (near) duplicates, etc.
This data-centric AI package is a quick first step for any computer vision project to find problems in the dataset, which you want to address before applying machine learning.
CleanVision is super simple -- run the same couple lines of Python code to audit any image dataset!
[](https://cleanvision.readthedocs.io/en/latest/)
[](https://pypi.org/pypi/cleanvision/)
[](https://pypi.org/pypi/cleanvision/)
[](https://pypi.org/pypi/cleanvision/)
[](https://codecov.io/gh/cleanlab/cleanvision)
## Installation
```shell
pip install cleanvision
```
## Quickstart
Download an example dataset (optional). Or just use any collection of image files you have.
```shell
wget -nc 'https://cleanlab-public.s3.amazonaws.com/CleanVision/image_files.zip'
```
1. Run CleanVision to audit the images.
```python
from cleanvision import Imagelab
# Specify path to folder containing the image files in your dataset
imagelab = Imagelab(data_path="FOLDER_WITH_IMAGES/")
# Automatically check for a predefined list of issues within your dataset
imagelab.find_issues()
# Produce a neat report of the issues found in your dataset
imagelab.report()
```
2. CleanVision diagnoses many types of issues, but you can also check for only specific issues.
```python
issue_types = {"dark": {}, "blurry": {}}
imagelab.find_issues(issue_types=issue_types)
# Produce a report with only the specified issue_types
imagelab.report(issue_types=issue_types)
```
## More resources
- [Tutorial](https://cleanvision.readthedocs.io/en/latest/tutorials/tutorial.html)
- [Documentation](https://cleanvision.readthedocs.io/)
- [Blog](https://cleanlab.ai/blog/cleanvision/)
- [Run CleanVision on a HuggingFace dataset](https://cleanvision.readthedocs.io/en/latest/tutorials/huggingface_dataset.html)
- [Run CleanVision on a Torchvision dataset](https://cleanvision.readthedocs.io/en/latest/tutorials/torchvision_dataset.html)
- [Example script](https://github.com/cleanlab/cleanvision/blob/main/docs/source/tutorials/run.py) that can be run with: `python examples/run.py --path `
- [Additional example notebooks](https://github.com/cleanlab/cleanvision-examples)
- [FAQ](https://cleanvision.readthedocs.io/en/latest/faq.html)
## *Clean* your data for better Computer *Vision*
The quality of machine learning models hinges on the quality of the data used to train them, but it is hard to manually identify all of the low-quality data in a big dataset. CleanVision helps you automatically identify common types of data issues lurking in image datasets.
This package currently detects issues in the raw images themselves, making it a useful tool for any computer vision
task such as: classification, segmentation, object detection, pose estimation, keypoint detection, [generative modeling](https://openai.com/research/dall-e-2-pre-training-mitigations), etc.
To detect issues in the labels of your image data, you can instead
use the [cleanlab](https://github.com/cleanlab/cleanlab/) package.
In any collection of image files (most [formats](https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html) supported), CleanVision can detect the following types of issues:
| | Issue Type | Description | Issue Key | Example |
|---|------------------|-----------------------------------------------------------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| 1 | Exact Duplicates | Images that are identical to each other | exact_duplicates |  |
| 2 | Near Duplicates | Images that are visually almost identical | near_duplicates |  |
| 3 | Blurry | Images where details are fuzzy (out of focus) | blurry |  |
| 4 | Low Information | Images lacking content (little entropy in pixel values) | low_information |  |
| 5 | Dark | Irregularly dark images (*under*exposed) | dark |  |
| 6 | Light | Irregularly bright images (*over*exposed) | light |  |
| 7 | Grayscale | Images lacking color | grayscale |  |
| 8 | Odd Aspect Ratio | Images with an unusual aspect ratio (overly skinny/wide) | odd_aspect_ratio |  |
| 9 | Odd Size | Images that are abnormally large or small compared to the rest of the dataset | odd_size |
|
CleanVision supports Linux, macOS, and Windows and runs on Python 3.10+. Learn more from our [blog](https://cleanlab.ai/blog/cleanvision/).
## Community
* Interested in contributing? See the [contributing guide](CONTRIBUTING.md). An easy starting point is to
consider [issues](https://github.com/cleanlab/cleanvision/labels/good%20first%20issue) marked `good first issue`.
* Ready to start adding your own code? See the [development guide](DEVELOPMENT.md).
* Have an issue? [Search existing issues](https://github.com/cleanlab/cleanvision/issues?q=is%3Aissue)
or [submit a new issue](https://github.com/cleanlab/cleanvision/issues/new/choose).
[issue]: https://github.com/cleanlab/cleanvision/issues/new