awesome-open-data-centric-ai

Curated list of open source tooling for data-centric AI on unstructured data.
https://github.com/Renumics/awesome-open-data-centric-ai

Last synced: 5 days ago
JSON representation

Data versioning
- Data version control (DVC)
- deeplake
- Pachyderm
- Delta Lake - source storage framework that enables building a Lakehouse architecture. | ![GitHub stars](https://img.shields.io/github/stars/delta-io/delta?style=social) | <a href="https://github.com/delta-io/delta/blob/main/LICENSE"><img src="https://img.shields.io/github/license/delta-io/delta" height="15"/></a> |
- lakeFS - source tool that transforms your object storage into a Git-like repository. | ![GitHub stars](https://img.shields.io/github/stars/treeverse/lakeFS?style=social) | <a href="https://github.com/treeverse/lakeFS/blob/main/LICENSE"><img src="https://img.shields.io/github/license/treeverse/lakeFS" height="15"/></a> |
Embeddings and pre-trained models
- towhee - io/towhee?style=social) | <a href="https://github.com/towhee-io/towhee/blob/main/LICENSE"><img src="https://img.shields.io/github/license/towhee-io/towhee" height="15"/></a> |
- Tensorflow Hub
- Huggingface transformers - of-the-art Machine Learning for Pytorch, TensorFlow, and JAX. | ![GitHub stars](https://img.shields.io/github/stars/huggingface/transformers?style=social) | <a href="https://github.com/huggingface/transformers/blob/main/LICENSE"><img src="https://img.shields.io/github/license/huggingface/transformers" height="15"/></a> |
- Lightly - supervised learning. | ![GitHub stars](https://img.shields.io/github/stars/lightly-ai/lightly?style=social) | <a href="https://github.com/lightly-ai/lightly/blob/main/LICENSE"><img src="https://img.shields.io/github/license/lightly-ai/lightly" height="15"/></a> |
Visualization and Interaction
- FiftyOne - source tool for building high-quality datasets and computer vision models. | ![GitHub stars](https://img.shields.io/github/stars/voxel51/fiftyone?style=social) | <a href="https://github.com/voxel51/fiftyone/blob/main/LICENSE"><img src="https://img.shields.io/github/license/voxel51/fiftyone" height="15"/></a> |
- refinery - source choice to scale, assess and maintain natural language data. | ![GitHub stars](https://img.shields.io/github/stars/code-kern-ai/refinery?style=social) | <a href="https://github.com/code-kern-ai/refinery/blob/main/LICENSE"><img src="https://img.shields.io/github/license/code-kern-ai/refinery" height="15"/></a> |
- Argilla - io/argilla?style=social) | <a href="https://github.com/argilla-io/argilla/blob/main/LICENSE"><img src="https://img.shields.io/github/license/argilla-io/argilla" height="15"/></a> |
- Xtreme1 - source platform for multisensory training data. | ![GitHub stars](https://img.shields.io/github/stars/xtreme1-io/xtreme1?style=social) | <a href="https://github.com/xtreme1-io/xtreme1/blob/main/LICENSE"><img src="https://img.shields.io/github/license/xtreme1-io/xtreme1" height="15"/></a> |
- YData Profiling - series data. | ![GitHub stars](https://img.shields.io/github/stars/ydataai/ydata-profiling?style=social) | <a href="https://github.com/ydataai/ydata-profiling/blob/master/LICENSE"><img src="https://img.shields.io/github/license/ydataai/ydata-profiling" height="15"/></a> |
- Renumics Spotlight - centric AI ecosystem. | ![GitHub stars](https://img.shields.io/github/stars/renumics/spotlight?style=social) | <a href="https://github.com/renumics/spotlight/blob/main/LICENSE"><img src="https://img.shields.io/github/license/renumics/spotlight" height="15"/></a> |
Outlier and noise detection
- Cleanlab - world data by providing clean labels for robust training and flagging errors in your data. | ![GitHub stars](https://img.shields.io/github/stars/cleanlab/cleanlab?style=social) | <a href="https://github.com/cleanlab/cleanlab/blob/main/LICENSE"><img src="https://img.shields.io/github/license/cleanlab/cleanlab" height="15"/></a> |
- PyOD
- TODS - stack automated time-series outlier detection system. | ![GitHub stars](https://img.shields.io/github/stars/datamllab/tods?style=social) | <a href="https://github.com/datamllab/tods/blob/main/LICENSE"><img src="https://img.shields.io/github/license/datamllab/tods" height="15"/></a> |
- Alibi Detect - detect?style=social) | <a href="https://github.com/SeldonIO/alibi-detect/blob/main/LICENSE"><img src="https://img.shields.io/github/license/SeldonIO/alibi-detect" height="15"/></a> |
Explainability
- Alibi
- LIME
- Captum
- SHAP
Active learning
- modAL - python/modAL?style=social) | <a href="https://github.com/modAL-python/modAL/blob/main/LICENSE"><img src="https://img.shields.io/github/license/modAL-python/modAL" height="15"/></a> |
- Bayesian Active Learning (Baal) - org/baal?style=social) | <a href="https://github.com/baal-org/baal/blob/main/LICENSE"><img src="https://img.shields.io/github/license/baal-org/baal" height="15"/></a> |
Uncertainty quantification
- MAPIE - learn-compatible module for estimating prediction intervals. | ![GitHub stars](https://img.shields.io/github/stars/scikit-learn-contrib/MAPIE?style=social) | <a href="https://github.com/scikit-learn-contrib/MAPIE/blob/main/LICENSE"><img src="https://img.shields.io/github/license/scikit-learn-contrib/MAPIE" height="15"/></a> |
- Uncertainty Toolbox - toolbox/uncertainty-toolbox?style=social) | <a href="https://github.com/scikit-learn-contrib/MAPIE/blob/main/LICENSE"><img src="https://img.shields.io/github/license/scikit-learn-contrib/MAPIE" height="15"/></a> |
Bias and fairness
- AIF360 - AI/AIF360?style=social) | <a href="https://github.com/Trusted-AI/AIF360/blob/main/LICENSE"><img src="https://img.shields.io/github/license/Trusted-AI/AIF360" height="15"/></a> |
- Fairlearn
Observability and Monitoring
- Arize-Phoenix - Phoenix is a Python library for ML observability (monitoring + root-cause analysis) for tabular, CV, NLP, and LLM models. | ![GitHub stars](https://img.shields.io/github/stars/Arize-AI/phoenix?style=social) | <a href="https://github.com/Arize-ai/phoenix/blob/main/LICENSE"><img src="https://img.shields.io/github/license/Arize-AI/phoenix" height="15"/></a> |
- Deepchecks
- Evidently - source framework to evaluate, test and monitor ML models in production. | ![GitHub stars](https://img.shields.io/github/stars/evidentlyai/evidently?style=social) | <a href="https://github.com/evidentlyai/evidently/blob/main/LICENSE"><img src="https://img.shields.io/github/license/evidentlyai/evidently" height="15"/></a> |
- langfuse
- langkit - source toolkit for monitoring Large Language Models (LLMs). | ![GitHub stars](https://img.shields.io/github/stars/whylabs/langkit?style=social) | <a href="https://github.com/whylabs/langkit/blob/main/LICENSE"><img src="https://img.shields.io/github/license/whylabs/langkit" height="15"/></a> |
Augmentation and synthetic data
- Albumentations - to-use wrapper around other libraries. | ![GitHub stars](https://img.shields.io/github/stars/albumentations-team/albumentations?style=social) | <a href="https://github.com/albumentations-team/albumentations/blob/main/LICENSE"><img src="https://img.shields.io/github/license/albumentations-team/albumentations" height="15"/></a> |
- Gretel Synthetics - synthetics?style=social) | <a href="https://github.com/gretelai/gretel-synthetics/blob/main/LICENSE"><img src="https://img.shields.io/github/license/gretelai/gretel-synthetics" height="15"/></a> |
- SDV - dev/SDV?style=social) | <a href="https://github.com/sdv-dev/SDV/blob/main/LICENSE"><img src="https://img.shields.io/github/license/sdv-dev/SDV" height="15"/></a> |
- YData Synthetic - series data by leveraging state-of-the-art generative models. | ![GitHub stars](https://img.shields.io/github/stars/ydataai/ydata-synthetic?style=social) | <a href="https://github.com/ydataai/ydata-synthetic/blob/dev/LICENSE"><img src="https://img.shields.io/github/license/ydataai/ydata-synthetic" height="15"/></a> |
Security and robustness
- CleverHans - lab/cleverhans?style=social) | <a href="https://github.com/cleverhans-lab/cleverhans/blob/main/LICENSE"><img src="https://img.shields.io/github/license/cleverhans-lab/cleverhans" height="15"/></a> |
- Adversarial Robustness Toolbox - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams. | ![GitHub stars](https://img.shields.io/github/stars/Trusted-AI/adversarial-robustness-toolbox?style=social) | <a href="https://github.com/Trusted-AI/adversarial-robustness-toolbox/blob/main/LICENSE"><img src="https://img.shields.io/github/license/Trusted-AI/adversarial-robustness-toolbox" height="15"/></a> |
- Foolbox
- Giskard - AI/giskard?style=social) | <a href="https://github.com/Giskard-AI/giskard/blob/main/LICENSE"><img src="https://img.shields.io/github/license/Giskard-AI/giskard" height="15"/></a> |
- guardrails - ai/guardrails?style=social) | <a href="https://github.com/guardrails-ai/guardrails/blob/main/LICENSE"><img src="https://img.shields.io/github/license/guardrails-ai/guardrails" height="15"/></a> |
- LLM-Guard - ai/llm-guard?style=social) | <a href="https://github.com/laiyer-ai/llm-guard/blob/main/LICENSE"><img src="https://img.shields.io/github/license/laiyer-ai/llm-guard" height="15"/></a> |
Monitoring
- awesome list - Centric-AI-Community).
- awesome list - io) on that topic.
- MLOps awesome lists
- this list - oriented view on DCAI.
- Detect data drift - nearest neighbor in the embedding space as the drift distance and inspect critical segments. | <a href="https://colab.research.google.com/github/Renumics/spotlight/blob/main/playbook/rookie/drift_kcore.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> |
Exploratory data analysis (EDA)
- Understand distributions - badge.svg" alt="Open In Colab"/></a> |
Cleaning
- Detect duplicates - badge.svg" alt="Open In Colab"/></a> |
- Detect outliers - badge.svg" alt="Open In Colab"/></a> |
- Detect image issues - badge.svg" alt="Open In Colab"/></a> |
Modeling
- Detect leakage - badge.svg" alt="Open In Colab"/></a> |
Validation
- Inspect decision boundaries - badge.svg" alt="Open In Colab"/></a> |

Programming Languages

Python 27 Jupyter Notebook 9 TypeScript 3 Go 2 TeX 1 JavaScript 1 Scala 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-open-data-centric-ai

Data versioning

Embeddings and pre-trained models

Visualization and Interaction

Outlier and noise detection

Explainability

Active learning

Uncertainty quantification

Bias and fairness

Observability and Monitoring

Augmentation and synthetic data

Security and robustness

Monitoring

Exploratory data analysis (EDA)

Cleaning

Modeling

Validation