Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-open-data-centric-ai
Curated list of open source tooling for data-centric AI on unstructured data.
https://github.com/Renumics/awesome-open-data-centric-ai
Last synced: 5 days ago
JSON representation
-
Data versioning
- Data version control (DVC)
- deeplake
- Pachyderm
- Delta Lake - source storage framework that enables building a Lakehouse architecture. | ![GitHub stars](https://img.shields.io/github/stars/delta-io/delta?style=social) | <a href="https://github.com/delta-io/delta/blob/main/LICENSE"><img src="https://img.shields.io/github/license/delta-io/delta" height="15"/></a> |
- lakeFS - source tool that transforms your object storage into a Git-like repository. | ![GitHub stars](https://img.shields.io/github/stars/treeverse/lakeFS?style=social) | <a href="https://github.com/treeverse/lakeFS/blob/main/LICENSE"><img src="https://img.shields.io/github/license/treeverse/lakeFS" height="15"/></a> |
-
Embeddings and pre-trained models
- towhee - io/towhee?style=social) | <a href="https://github.com/towhee-io/towhee/blob/main/LICENSE"><img src="https://img.shields.io/github/license/towhee-io/towhee" height="15"/></a> |
- Tensorflow Hub
- Huggingface transformers - of-the-art Machine Learning for Pytorch, TensorFlow, and JAX. | ![GitHub stars](https://img.shields.io/github/stars/huggingface/transformers?style=social) | <a href="https://github.com/huggingface/transformers/blob/main/LICENSE"><img src="https://img.shields.io/github/license/huggingface/transformers" height="15"/></a> |
- Lightly - supervised learning. | ![GitHub stars](https://img.shields.io/github/stars/lightly-ai/lightly?style=social) | <a href="https://github.com/lightly-ai/lightly/blob/main/LICENSE"><img src="https://img.shields.io/github/license/lightly-ai/lightly" height="15"/></a> |
-
Visualization and Interaction
- FiftyOne - source tool for building high-quality datasets and computer vision models. | ![GitHub stars](https://img.shields.io/github/stars/voxel51/fiftyone?style=social) | <a href="https://github.com/voxel51/fiftyone/blob/main/LICENSE"><img src="https://img.shields.io/github/license/voxel51/fiftyone" height="15"/></a> |
- refinery - source choice to scale, assess and maintain natural language data. | ![GitHub stars](https://img.shields.io/github/stars/code-kern-ai/refinery?style=social) | <a href="https://github.com/code-kern-ai/refinery/blob/main/LICENSE"><img src="https://img.shields.io/github/license/code-kern-ai/refinery" height="15"/></a> |
- Argilla - io/argilla?style=social) | <a href="https://github.com/argilla-io/argilla/blob/main/LICENSE"><img src="https://img.shields.io/github/license/argilla-io/argilla" height="15"/></a> |
- Xtreme1 - source platform for multisensory training data. | ![GitHub stars](https://img.shields.io/github/stars/xtreme1-io/xtreme1?style=social) | <a href="https://github.com/xtreme1-io/xtreme1/blob/main/LICENSE"><img src="https://img.shields.io/github/license/xtreme1-io/xtreme1" height="15"/></a> |
- YData Profiling - series data. | ![GitHub stars](https://img.shields.io/github/stars/ydataai/ydata-profiling?style=social) | <a href="https://github.com/ydataai/ydata-profiling/blob/master/LICENSE"><img src="https://img.shields.io/github/license/ydataai/ydata-profiling" height="15"/></a> |
- Renumics Spotlight - centric AI ecosystem. | ![GitHub stars](https://img.shields.io/github/stars/renumics/spotlight?style=social) | <a href="https://github.com/renumics/spotlight/blob/main/LICENSE"><img src="https://img.shields.io/github/license/renumics/spotlight" height="15"/></a> |
-
Outlier and noise detection
- Cleanlab - world data by providing clean labels for robust training and flagging errors in your data. | ![GitHub stars](https://img.shields.io/github/stars/cleanlab/cleanlab?style=social) | <a href="https://github.com/cleanlab/cleanlab/blob/main/LICENSE"><img src="https://img.shields.io/github/license/cleanlab/cleanlab" height="15"/></a> |
- PyOD
- TODS - stack automated time-series outlier detection system. | ![GitHub stars](https://img.shields.io/github/stars/datamllab/tods?style=social) | <a href="https://github.com/datamllab/tods/blob/main/LICENSE"><img src="https://img.shields.io/github/license/datamllab/tods" height="15"/></a> |
- Alibi Detect - detect?style=social) | <a href="https://github.com/SeldonIO/alibi-detect/blob/main/LICENSE"><img src="https://img.shields.io/github/license/SeldonIO/alibi-detect" height="15"/></a> |
-
Explainability
-
Active learning
- modAL - python/modAL?style=social) | <a href="https://github.com/modAL-python/modAL/blob/main/LICENSE"><img src="https://img.shields.io/github/license/modAL-python/modAL" height="15"/></a> |
- Bayesian Active Learning (Baal) - org/baal?style=social) | <a href="https://github.com/baal-org/baal/blob/main/LICENSE"><img src="https://img.shields.io/github/license/baal-org/baal" height="15"/></a> |
-
Uncertainty quantification
- MAPIE - learn-compatible module for estimating prediction intervals. | ![GitHub stars](https://img.shields.io/github/stars/scikit-learn-contrib/MAPIE?style=social) | <a href="https://github.com/scikit-learn-contrib/MAPIE/blob/main/LICENSE"><img src="https://img.shields.io/github/license/scikit-learn-contrib/MAPIE" height="15"/></a> |
-
Bias and fairness
-
Observability and Monitoring
- Arize-Phoenix - Phoenix is a Python library for ML observability (monitoring + root-cause analysis) for tabular, CV, NLP, and LLM models. | ![GitHub stars](https://img.shields.io/github/stars/Arize-AI/phoenix?style=social) | <a href="https://github.com/Arize-ai/phoenix/blob/main/LICENSE"><img src="https://img.shields.io/github/license/Arize-AI/phoenix" height="15"/></a> |
- Deepchecks
- Evidently - source framework to evaluate, test and monitor ML models in production. | ![GitHub stars](https://img.shields.io/github/stars/evidentlyai/evidently?style=social) | <a href="https://github.com/evidentlyai/evidently/blob/main/LICENSE"><img src="https://img.shields.io/github/license/evidentlyai/evidently" height="15"/></a> |
- langfuse
- langkit - source toolkit for monitoring Large Language Models (LLMs). | ![GitHub stars](https://img.shields.io/github/stars/whylabs/langkit?style=social) | <a href="https://github.com/whylabs/langkit/blob/main/LICENSE"><img src="https://img.shields.io/github/license/whylabs/langkit" height="15"/></a> |
-
Augmentation and synthetic data
- Albumentations - to-use wrapper around other libraries. | ![GitHub stars](https://img.shields.io/github/stars/albumentations-team/albumentations?style=social) | <a href="https://github.com/albumentations-team/albumentations/blob/main/LICENSE"><img src="https://img.shields.io/github/license/albumentations-team/albumentations" height="15"/></a> |
- Gretel Synthetics - synthetics?style=social) | <a href="https://github.com/gretelai/gretel-synthetics/blob/main/LICENSE"><img src="https://img.shields.io/github/license/gretelai/gretel-synthetics" height="15"/></a> |
- SDV - dev/SDV?style=social) | <a href="https://github.com/sdv-dev/SDV/blob/main/LICENSE"><img src="https://img.shields.io/github/license/sdv-dev/SDV" height="15"/></a> |
- YData Synthetic - series data by leveraging state-of-the-art generative models. | ![GitHub stars](https://img.shields.io/github/stars/ydataai/ydata-synthetic?style=social) | <a href="https://github.com/ydataai/ydata-synthetic/blob/dev/LICENSE"><img src="https://img.shields.io/github/license/ydataai/ydata-synthetic" height="15"/></a> |
-
Security and robustness
- CleverHans - lab/cleverhans?style=social) | <a href="https://github.com/cleverhans-lab/cleverhans/blob/main/LICENSE"><img src="https://img.shields.io/github/license/cleverhans-lab/cleverhans" height="15"/></a> |
- Adversarial Robustness Toolbox - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams. | ![GitHub stars](https://img.shields.io/github/stars/Trusted-AI/adversarial-robustness-toolbox?style=social) | <a href="https://github.com/Trusted-AI/adversarial-robustness-toolbox/blob/main/LICENSE"><img src="https://img.shields.io/github/license/Trusted-AI/adversarial-robustness-toolbox" height="15"/></a> |
- Foolbox
- Giskard - AI/giskard?style=social) | <a href="https://github.com/Giskard-AI/giskard/blob/main/LICENSE"><img src="https://img.shields.io/github/license/Giskard-AI/giskard" height="15"/></a> |
- guardrails - ai/guardrails?style=social) | <a href="https://github.com/guardrails-ai/guardrails/blob/main/LICENSE"><img src="https://img.shields.io/github/license/guardrails-ai/guardrails" height="15"/></a> |
-
Monitoring
- awesome list - Centric-AI-Community).
- awesome list - io) on that topic.
- MLOps awesome lists
- this list - oriented view on DCAI.
- Detect data drift - nearest neighbor in the embedding space as the drift distance and inspect critical segments. | <a href="https://colab.research.google.com/github/Renumics/spotlight/blob/main/playbook/rookie/drift_kcore.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> |
-
Exploratory data analysis (EDA)
- Understand distributions - badge.svg" alt="Open In Colab"/></a> |
-
Cleaning
- Detect duplicates - badge.svg" alt="Open In Colab"/></a> |
- Detect outliers - badge.svg" alt="Open In Colab"/></a> |
- Detect image issues - badge.svg" alt="Open In Colab"/></a> |
-
Modeling
- Detect leakage - badge.svg" alt="Open In Colab"/></a> |
-
Validation
- Inspect decision boundaries - badge.svg" alt="Open In Colab"/></a> |
Programming Languages
Categories
Visualization and Interaction
6
Security and robustness
5
Monitoring
5
Data versioning
5
Observability and Monitoring
5
Explainability
4
Outlier and noise detection
4
Embeddings and pre-trained models
4
Augmentation and synthetic data
4
Cleaning
3
Active learning
2
Bias and fairness
2
Exploratory data analysis (EDA)
1
Modeling
1
Validation
1
Uncertainty quantification
1
Sub Categories
Keywords
machine-learning
32
python
17
deep-learning
15
data-science
12
ai
11
mlops
8
artificial-intelligence
8
pytorch
7
llm
6
active-learning
6
computer-vision
6
data-centric-ai
5
tensorflow
5
data-quality
5
synthetic-data
4
nlp
4
image-classification
4
time-series
4
interpretability
4
data-analysis
4
annotation
3
developer-tools
3
fairness-ai
3
html-report
3
jupyter-notebook
3
responsible-ai
3
data-drift
3
llmops
3
data-curation
3
pandas-dataframe
3
datasets
3
image-processing
3
langchain
3
large-language-models
3
natural-language-processing
3
data-version-control
3
ml
3
unstructured-data
3
embeddings
3
outlier-detection
3
analytics
3
unsupervised-learning
2
python3
2
detection
2
images
2
data-mining
2
explainability
2
fairness
2
prompt-engineering
2
trusted-ai
2