Projects in Awesome Lists tagged with data-curation
A curated list of projects in awesome lists tagged with data-curation .
https://github.com/cleanlab/cleanlab
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
active-learning annotation data-centric-ai data-cleaning data-curation data-labeling data-profiling data-quality data-science data-validation dataops dataquality datasets exploratory-data-analysis labeling llms noisy-labels out-of-distribution-detection outlier-detection weak-supervision
Last synced: 08 Jan 2026
https://github.com/voxel51/fiftyone
Refine high-quality datasets and visual AI models
active-learning artificial-intelligence computer-vision data-centric-ai data-cleaning data-curation data-quality data-science deep-learning developer-tools image-classification machine-learning object-detection python unstructured-data vector-search visualization
Last synced: 19 Feb 2026
https://github.com/docta-ai/docta
A Doctor for your data
data data-centric-ai data-centric-machine-learning data-curation data-diagnosis language-model rlhf
Last synced: 13 May 2025
https://github.com/Docta-ai/docta
A Doctor for your data
data data-centric-ai data-centric-machine-learning data-curation data-diagnosis language-model rlhf
Last synced: 26 Mar 2025
https://github.com/visual-layer/fastdup
fastdup is a powerful, free tool designed to rapidly generate valuable insights from image and video datasets. It helps enhance the quality of both images and labels, while significantly reducing data operation costs, all with unmatched scalability.
data-augmentation data-curation dataset deep-learning image image-analysis image-classfication image-classification image-duplicate-detection image-processing image-similarity machine-learning novelty-detection object-detection outlier-detection python visual-search visualization visualization-tools
Last synced: 14 May 2025
https://github.com/renumics/spotlight
Interactively explore unstructured datasets from your dataframe.
audio computer-vision data-centric-ai data-curation data-visualization exploratory-data-analysis hacktoberfest images machine-learning meshes timeseries unstructured-data video
Last synced: 14 May 2025
https://github.com/daochenzha/data-centric-ai
A curated, but incomplete, list of data-centric AI resources.
ai artificial-intelligence data-centric data-centric-ai data-centric-machine-learning data-curation data-engineering data-quality data-science machine-learning
Last synced: 05 Feb 2026
https://github.com/daochenzha/data-centric-AI
A curated, but incomplete, list of data-centric AI resources.
ai artificial-intelligence data-centric data-centric-ai data-centric-machine-learning data-curation data-engineering data-quality data-science machine-learning
Last synced: 26 Mar 2025
https://github.com/NVIDIA/NeMo-Curator
Scalable data pre processing and curation toolkit for LLMs
data data-curation data-prep data-preparation data-processing data-processing-pipelines data-quality datacuration datarecipes deduplication fast-data-processing fine-tuning large-language-models large-scale-data-processing llm llm-data-quality llmapps python semantic-deduplication
Last synced: 29 Jul 2025
https://github.com/Renumics/spotlight
Interactively explore unstructured datasets from your dataframe.
audio computer-vision data-centric-ai data-curation data-visualization exploratory-data-analysis hacktoberfest images machine-learning meshes timeseries unstructured-data video
Last synced: 09 Apr 2025
https://github.com/NVIDIA-NeMo/Curator
Scalable data pre processing and curation toolkit for LLMs
data data-curation data-prep data-preparation data-processing data-processing-pipelines data-quality datacuration datarecipes deduplication fast-data-processing fine-tuning large-language-models large-scale-data-processing llm llm-data-quality llmapps python semantic-deduplication
Last synced: 20 Jul 2025
https://github.com/renumics/sliceguard
A library for detecting problematic data segments in structured and unstructured data with few lines of code.
data-analysis data-cleaning data-curation data-exploration data-science data-visualization deep-learning eda exploratory-data-analysis machine-learning python visualization
Last synced: 16 Mar 2025
https://github.com/laureberti/learn2clean
Learn2Clean: Optimizing the Sequence of Tasks for Data Preparation and Cleaning
automated data-cleaning data-cleaning-pipeline data-curation data-preprocessing reinforcement-learning
Last synced: 11 Sep 2025
https://github.com/brainlife/ezbids
A web service for semi-automated conversion of raw imaging data to BIDS
bids bids-converter brain-imaging data-curation dicom interoperability mri neuroimaging nifti web
Last synced: 23 Feb 2026
https://github.com/cleanlab/cleanlab-studio
Client interface to Cleanlab Studio and the Trustworthy Language Model
annotations automl computer-vision data-centric-ai data-cleaning data-curation data-labeling data-profiling data-quality data-science data-validation image-classification llm machine-learning model-deployment natural-language-processing noisy-labels outlier-detection structured-data text-classification
Last synced: 13 Apr 2025
https://github.com/iwangjian/TopDial
Code and data for "Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation" (EMNLP 2023)
data-curation dialogue-systems personalization
Last synced: 16 Apr 2025
https://github.com/wolframresearch/data-curation-training
companion data-curation livecoding training training-materials
Last synced: 15 Apr 2025
https://github.com/vida-nyu/openclean-core
Data Cleaning and Data Profiling Library for Python
data-cleaning data-curation hacktoberfest
Last synced: 10 Apr 2025
https://github.com/thehyve/tmtk
tranSMART Arborist ETL toolkit
data-curation data-modeling jupyter-notebook transmart
Last synced: 02 Oct 2025
https://github.com/arup-cas/aiscr-webamcr
Archaeological Map of the Czech Republic (AMCR)
archaeology data-curation digital-archive fair repository
Last synced: 11 Feb 2026
https://github.com/voxel51/fiftyone_mlflow_plugin
Track model training experiments with MLflow and FiftyOne!
computer-vision data-curation experiment-tracking fiftyone fiftyone-datasets mlflow
Last synced: 26 Jun 2025
https://github.com/caumente/multi_task_breast_cancer
Multi-task framework for breast cancer segmentation and classification
breast-cancer classification computer-vision data-curation deep-learning segmentation ultrasound-imaging
Last synced: 18 Jan 2026
https://github.com/cgnorthcutt/reliablity_framework_for_rag
Demo showing how the Trustworthy Language Model add reliability to LLM outputs and improves RAG, agents, and data enrichment worfklows. can be used to improve fine-tuning of LLMs, accuracy of LLM outputs, and smart routing for RAG and agents.
chatgpt data-cleaning data-curation data-observability data-quality llms observability rag
Last synced: 29 Jul 2025
https://github.com/acdh-oeaw/tokeneditor
TokenEditor is a web application for manual annotation (or manual review of automatic annotations) of text. Albeit primarily aimed at reviewing PoS tags and lemmas, it is fully customizable, to support any annotation levels.
data-curation lemmatization part-of-speech-tagging
Last synced: 16 Mar 2025
https://github.com/laura-budurlean/data-wrangling-exercise-ro4532a
This R script performs data wrangling, cleaning, and transformation tasks for a fictitious study RO4532A. It processes multiple sheets from an Excel file, merges and reshapes the data, and generates a curated dataset.
data-cleaning data-curation data-transformation data-wrangling r-programming
Last synced: 29 Jun 2025
https://github.com/apelullo/yelp_health_data_curation_ops
An AWS-based data pipeline to extract, process, store, and monitor Yelp "health-related" facility data in support of ongoing health system initiatives.
academic-research automation aws data-access data-curation data-infrastructure data-pipelines health-data operations operations-research python yelp-dataset
Last synced: 21 Jul 2025