Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-open-data-centric-ai

Curated list of open source tooling for data-centric AI on unstructured data.
https://github.com/Renumics/awesome-open-data-centric-ai

Last synced: 2 days ago
JSON representation

  • Data versioning

    • Data version control (DVC)
    • deeplake
    • Pachyderm
    • Delta Lake - source storage framework that enables building a Lakehouse architecture. | ![GitHub stars](https://img.shields.io/github/stars/delta-io/delta?style=social) | <a href="https://github.com/delta-io/delta/blob/main/LICENSE"><img src="https://img.shields.io/github/license/delta-io/delta" height="15"/></a> |
    • lakeFS - source tool that transforms your object storage into a Git-like repository. | ![GitHub stars](https://img.shields.io/github/stars/treeverse/lakeFS?style=social) | <a href="https://github.com/treeverse/lakeFS/blob/main/LICENSE"><img src="https://img.shields.io/github/license/treeverse/lakeFS" height="15"/></a> |
  • Embeddings and pre-trained models

    • towhee - io/towhee?style=social) | <a href="https://github.com/towhee-io/towhee/blob/main/LICENSE"><img src="https://img.shields.io/github/license/towhee-io/towhee" height="15"/></a> |
    • Tensorflow Hub
    • Huggingface transformers - of-the-art Machine Learning for Pytorch, TensorFlow, and JAX. | ![GitHub stars](https://img.shields.io/github/stars/huggingface/transformers?style=social) | <a href="https://github.com/huggingface/transformers/blob/main/LICENSE"><img src="https://img.shields.io/github/license/huggingface/transformers" height="15"/></a> |
    • Lightly - supervised learning. | ![GitHub stars](https://img.shields.io/github/stars/lightly-ai/lightly?style=social) | <a href="https://github.com/lightly-ai/lightly/blob/main/LICENSE"><img src="https://img.shields.io/github/license/lightly-ai/lightly" height="15"/></a> |
  • Visualization and Interaction

    • FiftyOne - source tool for building high-quality datasets and computer vision models. | ![GitHub stars](https://img.shields.io/github/stars/voxel51/fiftyone?style=social) | <a href="https://github.com/voxel51/fiftyone/blob/main/LICENSE"><img src="https://img.shields.io/github/license/voxel51/fiftyone" height="15"/></a> |
    • refinery - source choice to scale, assess and maintain natural language data. | ![GitHub stars](https://img.shields.io/github/stars/code-kern-ai/refinery?style=social) | <a href="https://github.com/code-kern-ai/refinery/blob/main/LICENSE"><img src="https://img.shields.io/github/license/code-kern-ai/refinery" height="15"/></a> |
    • Argilla - io/argilla?style=social) | <a href="https://github.com/argilla-io/argilla/blob/main/LICENSE"><img src="https://img.shields.io/github/license/argilla-io/argilla" height="15"/></a> |
    • Xtreme1 - source platform for multisensory training data. | ![GitHub stars](https://img.shields.io/github/stars/xtreme1-io/xtreme1?style=social) | <a href="https://github.com/xtreme1-io/xtreme1/blob/main/LICENSE"><img src="https://img.shields.io/github/license/xtreme1-io/xtreme1" height="15"/></a> |
    • YData Profiling - series data. | ![GitHub stars](https://img.shields.io/github/stars/ydataai/ydata-profiling?style=social) | <a href="https://github.com/ydataai/ydata-profiling/blob/master/LICENSE"><img src="https://img.shields.io/github/license/ydataai/ydata-profiling" height="15"/></a> |
    • Renumics Spotlight - centric AI ecosystem. | ![GitHub stars](https://img.shields.io/github/stars/renumics/spotlight?style=social) | <a href="https://github.com/renumics/spotlight/blob/main/LICENSE"><img src="https://img.shields.io/github/license/renumics/spotlight" height="15"/></a> |
  • Outlier and noise detection

    • Cleanlab - world data by providing clean labels for robust training and flagging errors in your data. | ![GitHub stars](https://img.shields.io/github/stars/cleanlab/cleanlab?style=social) | <a href="https://github.com/cleanlab/cleanlab/blob/main/LICENSE"><img src="https://img.shields.io/github/license/cleanlab/cleanlab" height="15"/></a> |
    • PyOD
    • TODS - stack automated time-series outlier detection system. | ![GitHub stars](https://img.shields.io/github/stars/datamllab/tods?style=social) | <a href="https://github.com/datamllab/tods/blob/main/LICENSE"><img src="https://img.shields.io/github/license/datamllab/tods" height="15"/></a> |
    • Alibi Detect - detect?style=social) | <a href="https://github.com/SeldonIO/alibi-detect/blob/main/LICENSE"><img src="https://img.shields.io/github/license/SeldonIO/alibi-detect" height="15"/></a> |
  • Explainability

  • Active learning

    • modAL - python/modAL?style=social) | <a href="https://github.com/modAL-python/modAL/blob/main/LICENSE"><img src="https://img.shields.io/github/license/modAL-python/modAL" height="15"/></a> |
    • Bayesian Active Learning (Baal) - org/baal?style=social) | <a href="https://github.com/baal-org/baal/blob/main/LICENSE"><img src="https://img.shields.io/github/license/baal-org/baal" height="15"/></a> |
  • Uncertainty quantification

    • MAPIE - learn-compatible module for estimating prediction intervals. | ![GitHub stars](https://img.shields.io/github/stars/scikit-learn-contrib/MAPIE?style=social) | <a href="https://github.com/scikit-learn-contrib/MAPIE/blob/main/LICENSE"><img src="https://img.shields.io/github/license/scikit-learn-contrib/MAPIE" height="15"/></a> |
  • Bias and fairness

    • AIF360 - AI/AIF360?style=social) | <a href="https://github.com/Trusted-AI/AIF360/blob/main/LICENSE"><img src="https://img.shields.io/github/license/Trusted-AI/AIF360" height="15"/></a> |
    • Fairlearn
  • Observability and Monitoring

    • Arize-Phoenix - Phoenix is a Python library for ML observability (monitoring + root-cause analysis) for tabular, CV, NLP, and LLM models. | ![GitHub stars](https://img.shields.io/github/stars/Arize-AI/phoenix?style=social) | <a href="https://github.com/Arize-ai/phoenix/blob/main/LICENSE"><img src="https://img.shields.io/github/license/Arize-AI/phoenix" height="15"/></a> |
    • Deepchecks
    • Evidently - source framework to evaluate, test and monitor ML models in production. | ![GitHub stars](https://img.shields.io/github/stars/evidentlyai/evidently?style=social) | <a href="https://github.com/evidentlyai/evidently/blob/main/LICENSE"><img src="https://img.shields.io/github/license/evidentlyai/evidently" height="15"/></a> |
    • langfuse
    • langkit - source toolkit for monitoring Large Language Models (LLMs). | ![GitHub stars](https://img.shields.io/github/stars/whylabs/langkit?style=social) | <a href="https://github.com/whylabs/langkit/blob/main/LICENSE"><img src="https://img.shields.io/github/license/whylabs/langkit" height="15"/></a> |
  • Augmentation and synthetic data

    • Albumentations - to-use wrapper around other libraries. | ![GitHub stars](https://img.shields.io/github/stars/albumentations-team/albumentations?style=social) | <a href="https://github.com/albumentations-team/albumentations/blob/main/LICENSE"><img src="https://img.shields.io/github/license/albumentations-team/albumentations" height="15"/></a> |
    • Gretel Synthetics - synthetics?style=social) | <a href="https://github.com/gretelai/gretel-synthetics/blob/main/LICENSE"><img src="https://img.shields.io/github/license/gretelai/gretel-synthetics" height="15"/></a> |
    • SDV - dev/SDV?style=social) | <a href="https://github.com/sdv-dev/SDV/blob/main/LICENSE"><img src="https://img.shields.io/github/license/sdv-dev/SDV" height="15"/></a> |
    • YData Synthetic - series data by leveraging state-of-the-art generative models. | ![GitHub stars](https://img.shields.io/github/stars/ydataai/ydata-synthetic?style=social) | <a href="https://github.com/ydataai/ydata-synthetic/blob/dev/LICENSE"><img src="https://img.shields.io/github/license/ydataai/ydata-synthetic" height="15"/></a> |
  • Security and robustness

    • CleverHans - lab/cleverhans?style=social) | <a href="https://github.com/cleverhans-lab/cleverhans/blob/main/LICENSE"><img src="https://img.shields.io/github/license/cleverhans-lab/cleverhans" height="15"/></a> |
    • Adversarial Robustness Toolbox - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams. | ![GitHub stars](https://img.shields.io/github/stars/Trusted-AI/adversarial-robustness-toolbox?style=social) | <a href="https://github.com/Trusted-AI/adversarial-robustness-toolbox/blob/main/LICENSE"><img src="https://img.shields.io/github/license/Trusted-AI/adversarial-robustness-toolbox" height="15"/></a> |
    • Foolbox
    • Giskard - AI/giskard?style=social) | <a href="https://github.com/Giskard-AI/giskard/blob/main/LICENSE"><img src="https://img.shields.io/github/license/Giskard-AI/giskard" height="15"/></a> |
    • guardrails - ai/guardrails?style=social) | <a href="https://github.com/guardrails-ai/guardrails/blob/main/LICENSE"><img src="https://img.shields.io/github/license/guardrails-ai/guardrails" height="15"/></a> |
  • Monitoring

    • awesome list - Centric-AI-Community).
    • awesome list - io) on that topic.
    • MLOps awesome lists
    • this list - oriented view on DCAI.
    • Detect data drift - nearest neighbor in the embedding space as the drift distance and inspect critical segments. | <a href="https://colab.research.google.com/github/Renumics/spotlight/blob/main/playbook/rookie/drift_kcore.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> |
  • Exploratory data analysis (EDA)

  • Cleaning

  • Modeling

  • Validation