An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with data-curation

A curated list of projects in awesome lists tagged with data-curation .

https://github.com/visual-layer/fastdup

fastdup is a powerful, free tool designed to rapidly generate valuable insights from image and video datasets. It helps enhance the quality of both images and labels, while significantly reducing data operation costs, all with unmatched scalability.

data-augmentation data-curation dataset deep-learning image image-analysis image-classfication image-classification image-duplicate-detection image-processing image-similarity machine-learning novelty-detection object-detection outlier-detection python visual-search visualization visualization-tools

Last synced: 14 May 2025

https://github.com/renumics/sliceguard

A library for detecting problematic data segments in structured and unstructured data with few lines of code.

data-analysis data-cleaning data-curation data-exploration data-science data-visualization deep-learning eda exploratory-data-analysis machine-learning python visualization

Last synced: 16 Mar 2025

https://github.com/laureberti/learn2clean

Learn2Clean: Optimizing the Sequence of Tasks for Data Preparation and Cleaning

automated data-cleaning data-cleaning-pipeline data-curation data-preprocessing reinforcement-learning

Last synced: 11 Sep 2025

https://github.com/brainlife/ezbids

A web service for semi-automated conversion of raw imaging data to BIDS

bids bids-converter brain-imaging data-curation dicom interoperability mri neuroimaging nifti web

Last synced: 23 Feb 2026

https://github.com/iwangjian/TopDial

Code and data for "Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation" (EMNLP 2023)

data-curation dialogue-systems personalization

Last synced: 16 Apr 2025

https://github.com/vida-nyu/openclean-core

Data Cleaning and Data Profiling Library for Python

data-cleaning data-curation hacktoberfest

Last synced: 10 Apr 2025

https://github.com/thehyve/tmtk

tranSMART Arborist ETL toolkit

data-curation data-modeling jupyter-notebook transmart

Last synced: 02 Oct 2025

https://github.com/arup-cas/aiscr-webamcr

Archaeological Map of the Czech Republic (AMCR)

archaeology data-curation digital-archive fair repository

Last synced: 11 Feb 2026

https://github.com/voxel51/fiftyone_mlflow_plugin

Track model training experiments with MLflow and FiftyOne!

computer-vision data-curation experiment-tracking fiftyone fiftyone-datasets mlflow

Last synced: 26 Jun 2025

https://github.com/caumente/multi_task_breast_cancer

Multi-task framework for breast cancer segmentation and classification

breast-cancer classification computer-vision data-curation deep-learning segmentation ultrasound-imaging

Last synced: 18 Jan 2026

https://github.com/cgnorthcutt/reliablity_framework_for_rag

Demo showing how the Trustworthy Language Model add reliability to LLM outputs and improves RAG, agents, and data enrichment worfklows. can be used to improve fine-tuning of LLMs, accuracy of LLM outputs, and smart routing for RAG and agents.

chatgpt data-cleaning data-curation data-observability data-quality llms observability rag

Last synced: 29 Jul 2025

https://github.com/acdh-oeaw/tokeneditor

TokenEditor is a web application for manual annotation (or manual review of automatic annotations) of text. Albeit primarily aimed at reviewing PoS tags and lemmas, it is fully customizable, to support any annotation levels.

data-curation lemmatization part-of-speech-tagging

Last synced: 16 Mar 2025

https://github.com/laura-budurlean/data-wrangling-exercise-ro4532a

This R script performs data wrangling, cleaning, and transformation tasks for a fictitious study RO4532A. It processes multiple sheets from an Excel file, merges and reshapes the data, and generates a curated dataset.

data-cleaning data-curation data-transformation data-wrangling r-programming

Last synced: 29 Jun 2025

https://github.com/apelullo/yelp_health_data_curation_ops

An AWS-based data pipeline to extract, process, store, and monitor Yelp "health-related" facility data in support of ongoing health system initiatives.

academic-research automation aws data-access data-curation data-infrastructure data-pipelines health-data operations operations-research python yelp-dataset

Last synced: 21 Jul 2025