Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/Renumics/awesome-open-data-centric-ai

Curated list of open source tooling for data-centric AI on unstructured data.
https://github.com/Renumics/awesome-open-data-centric-ai

List: awesome-open-data-centric-ai

active-learning awesome-list bias-detection computer-vision data-centric-ai data-curation data-drift data-versioning data-visualization deep-learning explainable-ai feature-vector machine-learning nlp noisy-labels outlier-detection robust-machine-learning synthetic-data uncertainty-estimation

Last synced: about 1 month ago
JSON representation

Curated list of open source tooling for data-centric AI on unstructured data.

Awesome Lists containing this project

README

        

Awesome open data-centric AI


Open source tooling for data-centric AI on unstructured data


Awesome

**Data-centric AI (DCAI)** is a development paradigm for ML-based solutions. The term was coined by Andrew Ng who gave the following definition:

> Data-centric AI is the practice of systematically engineering the data used to build AI systems.

At [Renumics](https://renumics.com), we believe DCAI is an important puzzle piece for building real-world AI systems that generate value. We like the following definition:
> Data-centric AI means to improve training datasets systematically and iteratively by leveraging information from trained ML models.

**Tools that can be efficiently used in day-to-day applications** are the most important ingredient for the DCAI paradigm. This curated link collection is intended to help you discover useful open source tools for your data-centric AI workflows.

## 🔎 Scope

We include useful tools that have an **open-source license** and are **actively maintained** in this collection. All tools mentioned are useful for building DCAI workflows on **unstructured data** (e.g. images, audio, video, time-series, text).

We also collect **workflow snippets** into a **data-centric AI playbook** that show how typical tasks can be solved with open source tooling.

In order to keep a useful focus and to prevent duplicate work, we exclude some topics from this list such as tooling for tabular data, dedicated labeling tools, MLOps tooling as well as research papers. Please check out the [further reading](#further-reading) section to find awesome lists for these topics.

## :open_hands: Contributing
Do you think something is missing? Please help contribute to this list by contacting us or adding a pull request.

# 🧰 Tooling

## 📒 Categories

- [Data versioning](#data-versioning)
- [Embeddings and pre-trained models](#embeddings-and-pre-trained-models)
- [Visualization and interaction](#visualization-and-interaction)
- [Outlier and noise detection](#outlier-and-noise-detection)
- [Explainability](#explainability)
- [Active learning](#active-learning)
- [Uncertainty quantification](#uncertainty-quantification)
- [Bias and fairness](#bias-and-fairness)
- [Observability and monitoring](#observability-and-monitoring)
- [Augmentation and synthetic data](#augmentation-and-synthetic-data)
- [Security and robustness](#security-and-robustness)

## Data versioning

| Logo | Name | Description | Popularity | License |
| ------- | ---- | ----------- | ---------- | -------- |
| | [Data version control (DVC)](https://github.com/iterative/dvc) | Data Version Control or DVC is a command line tool and VS Code Extension to help you develop reproducible machine learning projects. | ![GitHub stars](https://img.shields.io/github/stars/iterative/dvc?style=social) | |
| | [deeplake](https://github.com/activeloopai/deeplake) | Data Lake for Deep Learning. Build, manage, query, version, & visualize datasets. | ![GitHub stars](https://img.shields.io/github/stars/activeloopai/deeplake?style=social) | |
| | [Pachyderm](https://github.com/pachyderm/pachyderm) | Pachyderm – Automate data transformations with data versioning and lineage. | ![GitHub stars](https://img.shields.io/github/stars/pachyderm/pachyderm?style=social) | |
| | [Delta Lake](https://github.com/delta-io/delta) | An open-source storage framework that enables building a Lakehouse architecture. | ![GitHub stars](https://img.shields.io/github/stars/delta-io/delta?style=social) | |
| | [lakeFS](https://github.com/treeverse/lakeFS) | lakeFS is an open-source tool that transforms your object storage into a Git-like repository. | ![GitHub stars](https://img.shields.io/github/stars/treeverse/lakeFS?style=social) | |

## Embeddings and pre-trained models

| Logo | Name | Description | Popularity | License |
| ------- | ---- | ----------- | ---------- | -------- |
| | [towhee](https://github.com/towhee-io/towhee) | Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast. | ![GitHub stars](https://img.shields.io/github/stars/towhee-io/towhee?style=social) | |
| | [Tensorflow Hub](https://github.com/tensorflow/hub) | TensorFlow Hub is a repository of reusable assets for machine learning with TensorFlow. | ![GitHub stars](https://img.shields.io/github/stars/tensorflow/hub?style=social) | |
| | [Huggingface transformers](https://github.com/huggingface/transformers) | State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX. | ![GitHub stars](https://img.shields.io/github/stars/huggingface/transformers?style=social) | |
| | [Lightly](https://github.com/lightly-ai/lightly) | Lightly is a computer vision framework for self-supervised learning. | ![GitHub stars](https://img.shields.io/github/stars/lightly-ai/lightly?style=social) | |

## Visualization and Interaction

| Logo | Name | Description | Popularity | License |
| ------- | ---- | ----------- | ---------- | -------- |
| | [Renumics Spotlight](https://github.com/renumics/spotlight) | Curation tool for unstructured data that connects your stack to the data-centric AI ecosystem. | ![GitHub stars](https://img.shields.io/github/stars/renumics/spotlight?style=social) | |
| | [FiftyOne ](https://github.com/voxel51/fiftyone) | The open-source tool for building high-quality datasets and computer vision models. | ![GitHub stars](https://img.shields.io/github/stars/voxel51/fiftyone?style=social) | |
| | [refinery](https://github.com/code-kern-ai/refinery) | The data scientist's open-source choice to scale, assess and maintain natural language data. | ![GitHub stars](https://img.shields.io/github/stars/code-kern-ai/refinery?style=social) | |
| | [Argilla](https://github.com/argilla-io/argilla) | Argilla helps domain experts and data teams to build better NLP datasets in less time. | ![GitHub stars](https://img.shields.io/github/stars/argilla-io/argilla?style=social) | |
| | [Xtreme1](https://github.com/xtreme1-io/xtreme1) | Xtreme1 is the world's first open-source platform for multisensory training data. | ![GitHub stars](https://img.shields.io/github/stars/xtreme1-io/xtreme1?style=social) | |
| | [YData Profiling](https://github.com/ydataai/ydata-profiling) | YData Profiling is a python package to perform Exploratory Data Analysis (EDA) for tabular and time-series data. | ![GitHub stars](https://img.shields.io/github/stars/ydataai/ydata-profiling?style=social) | |

## Outlier and noise detection

| Logo | Name | Description | Popularity | License |
| ------- | ---- | ----------- | ---------- | -------- |
| | [Cleanlab](https://github.com/cleanlab/cleanlab) | Cleanlab facilitates machine learning with messy, real-world data by providing clean labels for robust training and flagging errors in your data. | ![GitHub stars](https://img.shields.io/github/stars/cleanlab/cleanlab?style=social) | |
| **PyOD** | [PyOD](https://github.com/yzhao062/pyod) | A Comprehensive and Scalable Python Library for Outlier Detection (Anomaly Detection) | ![GitHub stars](https://img.shields.io/github/stars/yzhao062/pyod?style=social) | |
| | [TODS](https://github.com/datamllab/tods) | An full-stack automated time-series outlier detection system. | ![GitHub stars](https://img.shields.io/github/stars/datamllab/tods?style=social) | |
| | [Alibi Detect](https://github.com/SeldonIO/alibi-detect) | Algorithms for outlier, adversarial and drift detection. | ![GitHub stars](https://img.shields.io/github/stars/SeldonIO/alibi-detect?style=social) | |

## Explainability

| Logo | Name | Description | Popularity | License |
| ------- | ---- | ----------- | ---------- | -------- |
| | [SHAP](https://github.com/slundberg/shap) | A game theoretic approach to explain the output of any machine learning model. | ![GitHub stars](https://img.shields.io/github/stars/slundberg/shap?style=social) | |
| | [Alibi](https://github.com/SeldonIO/alibi) | Alibi is an open source Python library aimed at machine learning model inspection and interpretation. | ![GitHub stars](https://img.shields.io/github/stars/SeldonIO/alibi?style=social) | |
| **LIME** | [LIME](https://github.com/marcotcr/lime) | Explaining the predictions of any machine learning classifier. | ![GitHub stars](https://img.shields.io/github/stars/marcotcr/lime?style=social) | |
| | [Captum](https://github.com/pytorch/captum) | Model interpretability and understanding for PyTorch. | ![GitHub stars](https://img.shields.io/github/stars/pytorch/captum?style=social) | |

## Active learning

| Logo | Name | Description | Popularity | License |
| ------- | ---- | ----------- | ---------- | -------- |
| | [modAL](https://github.com/modAL-python/modAL) | A modular active learning framework for Python. | ![GitHub stars](https://img.shields.io/github/stars/modAL-python/modAL?style=social) | |
| | [Bayesian Active Learning (Baal)](https://github.com/baal-org/baal) | Library to enable Bayesian active learning in your research or labeling work. | ![GitHub stars](https://img.shields.io/github/stars/baal-org/baal?style=social) | |

## Uncertainty quantification

| Logo | Name | Description | Popularity | License |
| ------- | ---- | ----------- | ---------- | -------- |
| | [Uncertainty Toolbox](https://github.com/uncertainty-toolbox/uncertainty-toolbox/) | A Python toolbox for predictive uncertainty quantification, calibration, metrics, and visualization. | ![GitHub stars](https://img.shields.io/github/stars/uncertainty-toolbox/uncertainty-toolbox?style=social) | |
| | [MAPIE](https://github.com/scikit-learn-contrib/MAPIE) | A scikit-learn-compatible module for estimating prediction intervals. | ![GitHub stars](https://img.shields.io/github/stars/scikit-learn-contrib/MAPIE?style=social) | |

## Bias and fairness

| Logo | Name | Description | Popularity | License |
| ------- | ---- | ----------- | ---------- | -------- |
| | [AIF360](https://github.com/Trusted-AI/AIF360) | The AI Fairness 360 toolkit helps to detect and mitigate bias in machine learning models throughout the AI application lifecycle. | ![GitHub stars](https://img.shields.io/github/stars/Trusted-AI/AIF360?style=social) | |
| | [Fairlearn](https://github.com/fairlearn/fairlearn) | A Python package to assess and improve fairness of machine learning models. | ![GitHub stars](https://img.shields.io/github/stars/fairlearn/fairlearn?style=social) | |

## Observability and Monitoring

| Logo | Name | Description | Popularity | License |
| ------- | ---- | ----------- | ---------- | -------- |
| | [Arize-Phoenix](https://github.com/Arize-ai/phoenix) | Arize-Phoenix is a Python library for ML observability (monitoring + root-cause analysis) for tabular, CV, NLP, and LLM models. | ![GitHub stars](https://img.shields.io/github/stars/Arize-AI/phoenix?style=social) | |
| | [Deepchecks](https://github.com/deepchecks/deepchecks) | Deepchecks is a Python package for comprehensively validating your machine learning models and data with minimal effort. | ![GitHub stars](https://img.shields.io/github/stars/deepchecks/deepchecks?style=social) | |
| | [Evidently](https://github.com/evidentlyai/evidently) | An open-source framework to evaluate, test and monitor ML models in production. | ![GitHub stars](https://img.shields.io/github/stars/evidentlyai/evidently?style=social) | |
| | [langfuse](https://github.com/langfuse/langfuse) | Open source observability and analytics for LLM applications. | ![GitHub stars](https://img.shields.io/github/stars/langfuse/langfuse?style=social) | |
| | [langkit](https://github.com/whylabs/langkit) | An open-source toolkit for monitoring Large Language Models (LLMs). | ![GitHub stars](https://img.shields.io/github/stars/whylabs/langkit?style=social) | |

## Augmentation and synthetic data

| Logo | Name | Description | Popularity | License |
| ------- | ---- | ----------- | ---------- | -------- |
| | [Albumentations](https://github.com/albumentations-team/albumentations) | Fast image augmentation library and an easy-to-use wrapper around other libraries. | ![GitHub stars](https://img.shields.io/github/stars/albumentations-team/albumentations?style=social) | |
| | [Gretel Synthetics](https://github.com/gretelai/gretel-synthetics) | Synthetic data generators for structured and unstructured text, featuring differentially private learning. | ![GitHub stars](https://img.shields.io/github/stars/gretelai/gretel-synthetics?style=social) | |
| | [SDV](https://github.com/sdv-dev/SDV) | Synthetic Data Generation for tabular, relational and time series data. | ![GitHub stars](https://img.shields.io/github/stars/sdv-dev/SDV?style=social) | |
| | [YData Synthetic](https://github.com/ydataai/ydata-synthetic) | YData Synthetic is a python package to generate synthetic tabular and time-series data by leveraging state-of-the-art generative models. | ![GitHub stars](https://img.shields.io/github/stars/ydataai/ydata-synthetic?style=social) | |

## Security and robustness

| Logo | Name | Description | Popularity | License |
| ------- | ---- | ----------- | ---------- | -------- |
| | [CleverHans](https://github.com/cleverhans-lab/cleverhans) | An adversarial example library for constructing attacks, building defenses, and benchmarking both. | ![GitHub stars](https://img.shields.io/github/stars/cleverhans-lab/cleverhans?style=social) | |
| | [Adversarial Robustness Toolbox](https://github.com/Trusted-AI/adversarial-robustness-toolbox) | Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams. | ![GitHub stars](https://img.shields.io/github/stars/Trusted-AI/adversarial-robustness-toolbox?style=social) | |
| | [Foolbox](https://github.com/bethgelab/foolbox) | Foolbox is a Python library that lets you easily run adversarial attacks against machine learning models like deep neural networks. | ![GitHub stars](https://img.shields.io/github/stars/bethgelab/foolbox?style=social) | |
| | [Giskard](https://github.com/Giskard-AI/giskard) | The testing framework for ML models, from tabular to LLMs. | ![GitHub stars](https://img.shields.io/github/stars/Giskard-AI/giskard?style=social) | |
| | [LLM-Guard](https://github.com/laiyer-ai/llm-guard) | The Security Toolkit for LLM Interactions. | ![GitHub stars](https://img.shields.io/github/stars/laiyer-ai/llm-guard?style=social) | |
| | [guardrails](https://github.com/guardrails-ai/guardrails) | Adding guardrails to large language models. | ![GitHub stars](https://img.shields.io/github/stars/guardrails-ai/guardrails?style=social) | |

# 🏀 Data-centric AI playbook

## Exploratory data analysis (EDA)

| Name | Data type | Description | Notebook |
| ---- | ---- | ---- | ----------- |
| [Understand distributions](https://renumics.com/docs/playbook/huggingface-embedding) | image | Use the Huggingface transformers library to compute image embeddings and explore the dataset based on the similarity map and additional metdata. | Open In Colab |

## Cleaning

| Name | Data type | Description | Notebook |
| ---- | ---- | ---- | ----------- |
| [Detect duplicates](https://renumics.com/docs/playbook/duplicates-annoy/) | agnostic | Use the Annoy library to detect nearest neighbors in the embedding space and inspect data points that are duplicates / near duplicates. | Open In Colab |
| [Detect outliers](https://renumics.com/docs/playbook/outliers-cleanlab/) | agnostic | Use the Cleanlab library to compute outlier scores based on model output (embeddings, probabilities) and inspect outlier candidates. | Open In Colab |
| [Detect image issues](https://renumics.com/docs/playbook/cv-issues/) | image | Use the Cleanvision library to extrapact typical image issues (brightness, blurr, aspect ratio, SNR and duplicates) and identify critical segments through manual inspection.| Open In Colab |

## Annotation

| Name | Data type | Description | Notebook |
| ---- | ---- | ---- | ----------- |
| [Find label inconsistencies](https://renumics.com/docs/playbook/label-errors-cleanlab/) | agnostic | Use the Cleanlab library to compute label error flags based on model probabilities and manually inspect critical data segments. | Open In Colab |

## Modeling
| Name | Data type | Description | Notebook |
| ---- | ---- | ---- | ----------- |
| [Detect leakage](https://renumics.com/docs/playbook/leakage-annoy/) | agnostic | Use nearest neighbor distances to identify candidates for data leakage and manual inspect them | Open In Colab |

## Validation
| Name | Data type | Description | Notebook |
| ---- | ---- | ---- | ----------- |
| [Inspect decision boundaries](https://renumics.com/docs/playbook/decision-boundary/) | agnostic | Compute a decision boundary score based on certainty ratios and inspect the results in a scatter plot. | Open In Colab |

## Monitoring
| Name | Data type | Description | Notebook |
| ---- | ---- | ---- | ----------- |
| [Detect data drift ](https://renumics.com/docs/playbook/label-errors-cleanlab/) | agnostic | Compute the cosine distance of the k-nearest neighbor in the embedding space as the drift distance and inspect critical segments. | Open In Colab |

# 📖 Further reading

In order to keep a useful focus and to prevent duplicate work, we excluded some topics from this list. Read more about them here:

1. DCAI tools for tabular data. There is an [awesome list](https://github.com/Data-Centric-AI-Community/awesome-data-centric-ai) for that maintained by the [Ydata team](https://github.com/Data-Centric-AI-Community).
2. Labeling tools. Although labeling is part of the DCAI workflow, we refer to the [awesome list](https://github.com/zenml-io/awesome-open-data-annotation) of the [ZenML team](https://github.com/zenml-io) on that topic.
3. MLOps tooling. We exclude all topics that are clearly out of the DCAI scope and refer to established [MLOps awesome lists](https://github.com/EthicalML/awesome-production-machine-learning) for these tools.
4. Research papers. We focus on industrial-ready open source tooling, check out [this list](https://github.com/HazyResearch/data-centric-ai) for a research-oriented view on DCAI.