Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/cleanlab/cleanvision
Automatically find issues in image datasets and practice data-centric computer vision.
https://github.com/cleanlab/cleanvision
computer-vision data-centric-ai data-exploration data-profiling data-quality data-science data-validation deep-learning exploratory-data-analysis image-analysis image-classification image-generation image-quality image-segmentation
Last synced: 15 days ago
JSON representation
Automatically find issues in image datasets and practice data-centric computer vision.
- Host: GitHub
- URL: https://github.com/cleanlab/cleanvision
- Owner: cleanlab
- License: agpl-3.0
- Created: 2022-05-26T07:14:11.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-04-23T22:56:02.000Z (7 months ago)
- Last Synced: 2024-10-01T20:52:02.615Z (about 1 month ago)
- Topics: computer-vision, data-centric-ai, data-exploration, data-profiling, data-quality, data-science, data-validation, deep-learning, exploratory-data-analysis, image-analysis, image-classification, image-generation, image-quality, image-segmentation
- Language: Python
- Homepage: https://cleanvision.readthedocs.io/
- Size: 2.11 MB
- Stars: 1,009
- Watchers: 16
- Forks: 68
- Open Issues: 30
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
CleanVision automatically detects potential issues in image datasets like images that are: blurry, under/over-exposed, (near) duplicates, etc.
This data-centric AI package is a quick first step for any computer vision project to find problems in the dataset, which you want to address before applying machine learning.
CleanVision is super simple -- run the same couple lines of Python code to audit any image dataset![![Read the Docs](https://readthedocs.org/projects/cleanvision/badge/?version=latest)](https://cleanvision.readthedocs.io/en/latest/)
[![pypi](https://img.shields.io/pypi/v/cleanvision?color=blue)](https://pypi.org/pypi/cleanvision/)
[![os](https://img.shields.io/badge/platform-noarch-lightgrey)](https://pypi.org/pypi/cleanvision/)
[![py\_versions](https://img.shields.io/badge/python-3.7%2B-blue)](https://pypi.org/pypi/cleanvision/)
[![codecov](https://codecov.io/github/cleanlab/cleanvision/branch/main/graph/badge.svg?token=y1N6MluN9H)](https://codecov.io/gh/cleanlab/cleanvision)
[![Slack Community](https://img.shields.io/static/v1?logo=slack&style=flat&color=white&label=slack&message=community)](https://cleanlab.ai/slack)
[![Twitter](https://img.shields.io/twitter/follow/CleanlabAI?style=social)](https://twitter.com/CleanlabAI)
[![Cleanlab Studio](https://raw.githubusercontent.com/cleanlab/assets/master/shields/cl-studio-shield.svg)](https://cleanlab.ai/studio/?utm_source=github&utm_medium=readme&utm_campaign=clostostudio)## Installation
```shell
pip install cleanvision
```## Quickstart
Download an example dataset (optional). Or just use any collection of image files you have.
```shell
wget -nc 'https://cleanlab-public.s3.amazonaws.com/CleanVision/image_files.zip'
```1. Run CleanVision to audit the images.
```python
from cleanvision import Imagelab# Specify path to folder containing the image files in your dataset
imagelab = Imagelab(data_path="FOLDER_WITH_IMAGES/")# Automatically check for a predefined list of issues within your dataset
imagelab.find_issues()# Produce a neat report of the issues found in your dataset
imagelab.report()
```2. CleanVision diagnoses many types of issues, but you can also check for only specific issues.
```python
issue_types = {"dark": {}, "blurry": {}}imagelab.find_issues(issue_types=issue_types)
# Produce a report with only the specified issue_types
imagelab.report(issue_types=issue_types)
```## More resources on how to use CleanVision
- [Tutorial](https://cleanvision.readthedocs.io/en/latest/tutorials/tutorial.html)
- [Run CleanVision on a HuggingFace dataset](https://cleanvision.readthedocs.io/en/latest/tutorials/huggingface_dataset.html)
- [Run CleanVision on a Torchvision dataset](https://cleanvision.readthedocs.io/en/latest/tutorials/torchvision_dataset.html)
- [Example script](https://github.com/cleanlab/cleanvision/blob/main/docs/source/tutorials/run.py) that can be run with: `python examples/run.py --path `
- [Additional example notebooks](https://github.com/cleanlab/cleanvision-examples)
- [Documentation](https://cleanvision.readthedocs.io/)
- [Blog Post](https://cleanlab.ai/blog/cleanvision/)
- [FAQ](https://cleanvision.readthedocs.io/en/latest/faq.html)## *Clean* your data for better Computer *Vision*
The quality of machine learning models hinges on the quality of the data used to train them, but it is hard to manually identify all of the low-quality data in a big dataset. CleanVision helps you automatically identify common types of data issues lurking in image datasets.
This package currently detects issues in the raw images themselves, making it a useful tool for any computer vision
task such as: classification, segmentation, object detection, pose estimation, keypoint detection, [generative modeling](https://openai.com/research/dall-e-2-pre-training-mitigations), etc.
To detect issues in the labels of your image data, you can instead
use the [cleanlab](https://github.com/cleanlab/cleanlab/) package.In any collection of image files (most [formats](https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html) supported), CleanVision can detect the following types of issues:
| | Issue Type | Description | Issue Key | Example |
|---|------------------|-----------------------------------------------------------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| 1 | Exact Duplicates | Images that are identical to each other | exact_duplicates | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/exact_duplicates.png) |
| 2 | Near Duplicates | Images that are visually almost identical | near_duplicates | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/near_duplicates.png) |
| 3 | Blurry | Images where details are fuzzy (out of focus) | blurry | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/blurry.png) |
| 4 | Low Information | Images lacking content (little entropy in pixel values) | low_information | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/low_information.png) |
| 5 | Dark | Irregularly dark images (*under*exposed) | dark | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/dark.jpg) |
| 6 | Light | Irregularly bright images (*over*exposed) | light | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/light.jpg) |
| 7 | Grayscale | Images lacking color | grayscale | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/grayscale.jpg) |
| 8 | Odd Aspect Ratio | Images with an unusual aspect ratio (overly skinny/wide) | odd_aspect_ratio | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/odd_aspect_ratio.jpg) |
| 9 | Odd Size | Images that are abnormally large or small compared to the rest of the dataset | odd_size | |CleanVision supports Linux, macOS, and Windows and runs on Python 3.7+.
## Join our community
* The best place to learn is [our Slack community](https://cleanlab.ai/slack). Join the discussion there to see how
folks are using this library, discuss upcoming features, or ask for private support.* Need professional help with CleanVision? Join our [\#help Slack channel](https://cleanlab.ai/slack) and message us there, or reach out via email: [email protected]
* Interested in contributing? See the [contributing guide](CONTRIBUTING.md). An easy starting point is to
consider [issues](https://github.com/cleanlab/cleanvision/labels/good%20first%20issue) marked `good first issue` or
simply reach out in [Slack](https://cleanlab.ai/slack). We welcome your help building a standard open-source library
for data-centric computer vision!* Ready to start adding your own code? See the [development guide](DEVELOPMENT.md).
* Have an issue? [Search existing issues](https://github.com/cleanlab/cleanvision/issues?q=is%3Aissue)
or [submit a new issue](https://github.com/cleanlab/cleanvision/issues/new/choose).* Have ideas for the future of data-centric computer vision? Check
out [our active/planned Projects and what we could use your help with](https://github.com/cleanlab/cleanvision/projects).## License
Copyright (c) 2022 Cleanlab Inc.
cleanvision is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public
License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later
version.cleanvision is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.See [GNU Affero General Public LICENSE](https://github.com/cleanlab/cleanvision/blob/main/LICENSE) for details.
Commercial licensing is available for enterprise teams that want to use CleanVision in production workflows, but are unable to open-source their code [as is required by the current license](https://github.com/cleanlab/cleanvision/blob/main/LICENSE). Please email us: [email protected]
[issue]: https://github.com/cleanlab/cleanvision/issues/new