https://github.com/visual-layer/visuallayer

Simplify Your Visual Data Ops. Find and visualize issues with your computer vision datasets such as duplicates, anomalies, data leakage, mislabels and others.
https://github.com/visual-layer/visuallayer

cleaning computer computer-vision data data-science dataset datasets-preparation generative machine-learning python vision

Last synced: 10 months ago
JSON representation

Simplify Your Visual Data Ops. Find and visualize issues with your computer vision datasets such as duplicates, anomalies, data leakage, mislabels and others.

Host: GitHub
URL: https://github.com/visual-layer/visuallayer
Owner: visual-layer
License: apache-2.0
Created: 2023-04-04T06:30:34.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2023-09-27T09:50:45.000Z (over 2 years ago)
Last Synced: 2025-03-29T08:11:33.324Z (11 months ago)
Topics: cleaning, computer, computer-vision, data, data-science, dataset, datasets-preparation, generative, machine-learning, python, vision
Language: Jupyter Notebook
Homepage: https://www.visual-layer.com/
Size: 90.1 MB
Stars: 67
Watchers: 7
Forks: 2
Open Issues: 5
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

[![PyPi][pypi-shield]][pypi-url]
[![PyPi][pypiversion-shield]][pypi-url]
[![PyPi][downloads-shield]][downloads-url]
[![License][license-shield]][license-url]
[![TestedOn][testedon-shield]][pypi-url]

[pypi-shield]: https://img.shields.io/badge/Python-3.7%20--%203.11-blue?style=for-the-badge
[pypi-url]: https://pypi.org/project/visuallayer/
[pypiversion-shield]: https://img.shields.io/pypi/v/visuallayer?style=for-the-badge
[downloads-shield]: https://img.shields.io/badge/dynamic/json?style=for-the-badge&label=downloads&query=%24.total_downloads&url=https%3A%2F%2Fapi.pepy.tech%2Fapi%2Fv2%2Fprojects%2Fvisuallayer&color=lightblue
[downloads-url]: https://pypi.org/project/visuallayer/

[license-shield]: https://img.shields.io/badge/License-Apache%202.0-purple.svg?style=for-the-badge
[license-url]: https://github.com/visual-layer/visuallayer/blob/main/LICENSE
[testedon-shield]: https://img.shields.io/badge/Tested%20on-Ubuntu--22.04%20%7C%20MacOS--10.16%20Intel%20%7C%20Windows%2010-brightgreen?style=for-the-badge

Unleash the Full Power of Your Visual Data

Explore the docs »

Report Issues
·
Read Blog
·
Get In Touch
·
About Us

💫 Check out VL Datasets release blog post

## Description
`visuallayer SDK` is an open-source Python package that offers access and extensibility to the Visual Layer platform from your code.

While the platform offers a high-level overview and visualization of your data, the SDK affords you the flexibility to integrate into your favorite machine learning frameworks and environments (e.g. Jupyter Notebook) using Python.

## Installation

The easiest way to use the `visuallayer SDK` is to install it from [PyPI](https://pypi.org/project/visuallayer/). On your machine, run:

```shell
pip install visuallayer
```

Optionally, you can also install the bleeding edge version on [GitHub](https://github.com/visual-layer/visuallayer) by running:

```shell
pip install git+https://github.com/visual-layer/visuallayer.git@main --upgrade
```

## VL Datasets
The `visuallayer SDK` also lets you access [VL Datasets](https://docs.visual-layer.com/docs/what-are-vl-datasets) - a collection of clean versions of widely used computer vision datasets.

For example with only 2 lines of code, load the clean vl datasets version of the [ImageNet-1k](https://www.robots.ox.ac.uk/~vgg/data/pets/) dataset with:
```python
import visuallayer as vl
dataset = vl.datasets.zoo.load('vl-imagenet-1k')

#Export to PyTorch
train_dataset = dataset.export(output_format='pytorch', split='train')

#PyTorch training loop
```

> **Note**: `visuallayer` does not automatically download the ImageNet dataset, you should make sure to obtain usage rights to the dataset and download it into your current working directory first.

When we say "clean" we mean that the datasets loaded by `visuallayer SDK` were flagged from common issues such as [duplicates](https://docs.visual-layer.com/docs/duplicate-imagesobjects), [mislabels](https://docs.visual-layer.com/docs/mislabeled-imagesobjects), [outliers](https://docs.visual-layer.com/docs/outlier-imagesobjects),
[dark](https://docs.visual-layer.com/docs/blurry-imagesobjects-copy)/[bright](https://docs.visual-layer.com/docs/dark-imagesobjects-copy)/
[blurry](https://docs.visual-layer.com/docs/outlier-imagesobjects-copy) and data leakage.
See full description for issues support in our [documentation](https://docs.visual-layer.com/docs/mislabeled-imagesobjects).

## Dataset Zoo
We provide a [Dataset Zoo](https://docs.visual-layer.com/docs/available-datasets) where you can find all information for each VL Dataset.

For each dataset in the zoo, we ran an analysis using [VL Profiler](https://app.visual-layer.com) and found issues pertaining to the original dataset.
The following table is a detailed breakdown of the issues for each dataset.

Dataset Name
Total Images
Total Issues (%)
Total Issues (Count)
Duplicates (%)
Duplicates (Count)
Outliers (%)
Outliers (Count)
Blur (%)
Blur (Count)
Dark (%)
Dark (Count)
Bright (%)
Bright (Count)
Mislabels (%)
Mislabels (Count)
Leakage (%)
Leakage (Count)

ImageNet-21K

13,153,500

14.58%

1,917,948

10.53%

1,385,074

0.09%

11,119

0.29%

38,463

0.18%

23,575

0.43%

56,754

3.06%

402,963

ImageNet-1K

1,431,167

1.31%

17,492

0.57%

7,522

0.09%

1,199

0.19%

2,478

0.24%

3,174

0.06%

770

0.11%

1,480

0.07%

869

LAION-1B

1,000,000,000

10.40%

104,942,474

8.89%

89,349,899

0.63%

6,350,368

0.77%

7,763,266

0.02%

242,333

0.12%

1,236,608

KITTI

12,919

18.32%

2,748

15.29%

2,294

0.01%

3.01%

452

COCO

330,000

0.31%

508

0.12%

201

0.09%

143

0.03%

0.05%

0.01%

DeepFashion

800,000

7.89%

22,824

5.11%

14,773

0.04%

108

2.75%

7,943

CelebA-HQ

30,000

2.36%

4,786

1.67%

3,389

0.08%

157

0.51%

1,037

0.00%

0.01%

0.09%

188

Places365

1,800,000

2.09%

37,644

1.53%

27,520

0.40%

7,168

0.16%

2,956

Food-101

101,000

0.62%

627

0.23%

235

0.08%

0.18%

185

0.04%

Oxford-IIIT Pet

7,349

1.48%

132

1.01%

0.10%>

0.05%

0.31%

We provide here full details on each issues removed from a VL Dataset (a vl dataset card).
The clean version of a dataset is prefixed with `vl-` to differentiate it from the original dataset.
You can also freely download all found issues CSV.

VL Dataset Card
Original Dataset
Explore
Issues CSV
Hugging Face Dataset

vl-imagenet-21k
ImageNet-21K

vl-imagenet-1k
ImageNet-1K

vl-laion-1b
LAION-1B

vl-kitti
KITTI

vl-coco
COCO

vl-deepfashion
DeepFashion

vl-celeba-hq
CelebA-HQ

vl-places365
Places365

vl-food-101
Food-101

vl-oxford-iiit-pet
Oxford-IIIT Pet

We will continue to support more datasets. Here are a few currently in our roadmap:
+ EuroSAT
+ Flickr30k
+ INaturalist
+ SVHN
+ Cityscapes
+ RVL-CDIP
+ DocLayNet

[Let us know](https://forms.gle/8jxPkyzeKj82kPed8) if you have additional request to support a specific dataset.

> **Note**: If you'd like to use our cloud tool and discover issues with your own dataset, [sign up](https://app.visual-layer.com/) to use our cloud platform for free.

## Usage
The following sections show how to use the `visuallayer` SDK to load, inspect and export a VL Dataset.

### Loading a dataset
We offer handy functions to load datasets from the Dataset Zoo.
First, let's list the datasets in the zoo with:

```python
import visuallayer as vl
vl.datasets.zoo.list_datasets()
```

which currently outputs:

```shell
['vl-oxford-iiit-pets',
'vl-imagenet-21k',
'vl-imagenet-1k',
'vl-food101',
'oxford-iiit-pets',
'imagenet-21k',
'imagenet-1k',
'food101']
```

To load the dataset:

```python
vl.datasets.zoo.load('vl-oxford-iiit-pets')
```

This loads the clean version of the Oxford IIIT Pets dataset where all of the problematic images are excluded from the dataset.

To load the original Oxford IIIT Pets dataset, simply drop the `vl-` prefix:

```python
original_pets_dataset = vl.datasets.zoo.load('oxford-iiit-pets')
```

This loads the original dataset with no modifications.

### Inspecting a dataset
Now that you have a dataset loaded, you can view information pertaining to that dataset with:

```python
my_pets.info
```

This prints out high-level information about the original Dataset. In this example, we used the Pets Dataset from Oxford.

```shell
Metadata:
--> Name - vl-oxford-iiit-pets
--> Description - A modified version of the original Oxford IIIT Pets Dataset removing dataset issues.
--> License - Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
--> Homepage URL - https://www.robots.ox.ac.uk/~vgg/data/pets/
--> Number of Images - 7349
--> Number of Images with Issues - 109
```

If you'd like to view the issues related to the dataset, run:

```python
my_pets.report
```

which outputs:

```shell
| Reason | Count | Pct |
|-----------|-------|-------|
| Duplicate | 75 | 1.016 |
| Outlier | 7 | 0.095 |
| Dark | 4 | 0.054 |
| Leakage | 23 | 0.627 |
| Total | 109 | 1.792 |

```

Now that you've seen the issues with the dataset, you can visualize them on screen. There are two options to visualize the dataset issues.

> **Option 1** - Using the Visual Layer Profiler (VL Profiler) - Provides an extensive capability to view, group, sort, and filter the dataset issues. [Sign-up](https://app.visual-layer.com) for free.

Here's the visualization using the VL Profiler:

![profiler](./imgs/vl_profiler.gif)

> **Option 2** - In Jupyter Notebook - Provides a limited but convenient way to view the dataset without leaving your notebook.

To visualize the issues using **Option 2** in your notebook, run:

```python
my_pets.explore()
```

This should output an interactive table in your Jupyter Notebook like the following.

![explore](./imgs/explore.gif)

In the interactive table, you can view the issues, sort, filter, search, and compare the images side by side.

By default, the `.explore()` load the top 50 issues from the dataset covering all issue types. If you'd like a more granular control, you can change the `num_images` and `issue` arguments.

For example:

```python
pets_dataset.explore(num_images=100, issue='Duplicate')
```

The interactive table provides a convenient but limited way to visualize dataset issues.
For a more extensive visualization, view the issues using the Visual Layer Profiler.

Check out the [documentation](https://docs.visual-layer.com/docs/introduction) and blog page for the Visual Layer Profiler for more info.

### Exporting a dataset
If you'd like to use a loaded dataset to train a model, you can conveniently export the dataset with:

```python
test_dataset = my_pets.export(output_format="pytorch", split="test")
```
This exports the Dataset into a Pytorch `Dataset` object that can be used readily with a PyTorch training loop.

Alternatively, you can export the Dataset to a DataFrame with:

```python
test_dataset = pets_dataset.export(output_format="csv", split="test")
```

## Learn from Examples
In this section, we show an end-to-end example of how to load, inspect and export a dataset and then train using PyTorch and fastai framework.

Dataset: vl-food101

Framework: PyTorch.

Description: Load a dataset and train a PyTorch model.

Dataset: vl-food101

Frameworks: PyTorch + Hugging Face Dataset

Description: Load VL Datasets using Hugging Face Datasets and train a PyTorch model.

Dataset: vl-oxford-iiit-pet

Framework: fast.ai.

Description: Finetune a pretrained TIMM model using fastai.

Dataset: vl-imagenet-1k

Framework: PyTorch.

Description: Load cleaned ImageNet dataset and train a PyTorch model.

## License
`visuallayer SDK` is licensed under the Apache 2.0 License. See [LICENSE](./LICENSE).

However, you are bound to the usage license of the original dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license. We provide no warranty or guarantee of accuracy or completeness.

## Telemetry

Usage Tracking

This repository incorporates usage tracking using [Sentry.io](https://sentry.io) to monitor and collect valuable information about the usage of the application.

Usage tracking allows us to gain insights into how the application is being used in real-world scenarios. It provides us with valuable information that helps in understanding user behavior, identifying potential issues, and making informed decisions to improve the application.

We DO NOT collect folder names, user names, image names, image content, and other personally identifiable information.

**What data is tracked?**

- Errors and Exceptions: Sentry captures errors and exceptions that occur in the application, providing detailed stack traces and relevant information to help diagnose and fix issues.
- Performance Metrics: Sentry collects performance metrics, such as response times, latency, and resource usage, enabling us to monitor and optimize the application's performance.

To opt-out, define an environment variable named `SENTRY_OPT_OUT`.

On Linux/macOS, run the following:
```shell
export SENTRY_OPT_OUT=True
```

Read more on [Sentry's official webpage](https://sentry.io).

## Getting Help
Get help from the Visual Layer team or community members via the following channels -
+ [Slack](https://visualdatabase.slack.com/join/shared_invite/zt-19jaydbjn-lNDEDkgvSI1QwbTXSY6dlA#/shared-invite/email).
+ GitHub [issues](https://github.com/visual-layer/visuallayer/issues).
+ Discussion [forum](https://visual-layer.readme.io/discuss).

## About Visual-Layer

Visual Layer is founded by the authors of [XGBoost](https://github.com/apache/tvm), [Apache TVM](https://github.com/apache/tvm) & [Turi Create](https://github.com/apple/turicreate) - [Danny Bickson](https://www.linkedin.com/in/dr-danny-bickson-835b32), [Carlos Guestrin](https://www.linkedin.com/in/carlos-guestrin-5352a869) and [Amir Alush](https://www.linkedin.com/in/amiralush).

Learn more about Visual Layer [here](https://visual-layer.com).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/visual-layer/visuallayer

Awesome Lists containing this project

README

Unleash the Full Power of Your Visual Data