Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/data-forge/data-forge-ts
The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
csv data data-analysis data-cleaning data-cleansing data-forge data-management data-manipulation data-munging data-visualization data-wrangling javascript json linq nodejs pandas visualization
Last synced: 02 Jul 2024
https://github.com/charlesdedampierre/BunkaTopics
🗺️ Data Cleaning and Textual Data Visualization 🗺️
cartography data-cleaning explainability fine-tuning llms machine-learning natural-language-processing nlp summarization topic-modeling
Last synced: 21 Jun 2024
https://github.com/ChrisMuir/refinr
Cluster and merge similar string values: an R implementation of Open Refine clustering algorithms
approximate-string-matching clustering cran data-cleaning data-clustering fuzzy-matching ngram openrefine r rstats
Last synced: 21 Jun 2024
https://github.com/data-cleaning/errorlocate
Find and replace erroneous fields in data using validation rules
data-cleaning errors invalidation r
Last synced: 10 Jun 2024
https://github.com/the-Hull/datacleanr
Interactive and Reproducible Data Cleaning
annotation-tool data-cleaning outlier-detection outlier-removal reproducibility
Last synced: 10 Jun 2024
https://github.com/carlosstack/cleaning-data-in-r
:pick: :gem: Taking raw data and converting it to tidy data that can be used for later analysis
data-cleaning data-manipulation r-programming regular-expression
Last synced: 10 Jun 2024
https://github.com/data-cleaning/validatetools
data-cleaning r rules validation
Last synced: 04 Jun 2024
https://github.com/Nelson-Gon/mde
mde: Missing Data Explorer
data-analysis data-cleaning data-exploration data-science datacleaner datacleaning exploratory-data-analysis missing missing-data missing-value-treatment missing-values missingness omit r r-package r-stats recode replace rstats statistics
Last synced: 04 Jun 2024
https://github.com/data-cleaning/dcmodifydb
Deterministic, documented correction rules on a database
correction data-cleaning database r
Last synced: 04 Jun 2024
https://github.com/ropensci/taxa
taxonomic classes for R
data-cleaning r r-package rstats taxon taxonomy
Last synced: 04 Jun 2024
https://github.com/data-cleaning/validatesuggest
Generate validation rules from data
Last synced: 03 Jun 2024
https://github.com/marksweiss/sofine
Lightweight framework for creating data-collecting plugins and chaining calls to them from CLI, REST or Python to return unified data sets.
cross-language data-cleaning data-processing data-retrieval json python
Last synced: 03 Jun 2024
https://github.com/skrub-data/skrub
Prepping tables for machine learning
data data-analysis data-cleaning data-preparation data-preprocessing data-science data-wrangling dirty-data machine-learning
Last synced: 31 May 2024
https://github.com/awesome-mlops/awesome-ml-monitoring
A curated list of awesome open source tools and commercial products for monitoring data quality, monitoring model performance, and profiling data 🚀
aiops concept-drift data-cleaning data-drift data-management data-monitoring data-quality data-science dataops datascience deep-learning machine-learning machine-learning-platform mlops model-drift model-explainability model-management model-monitoring model-performance
Last synced: 31 May 2024
https://github.com/schema-inspector/schema-inspector
Schema-Inspector is a simple JavaScript object sanitization and validation module.
data-cleaning javascript sanitization validation
Last synced: 16 May 2024
https://github.com/johnkerl/miller
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
command-line command-line-tools csv csv-format data-cleaning data-processing data-reduction data-regression devops devops-tools json json-data miller statistical-analysis statistics streaming-algorithms streaming-data tabular-data tsv unix-toolkit
Last synced: 07 May 2024
https://github.com/unionai-oss/pandera
A light-weight, flexible, and expressive statistical data testing library
assertions data-assertions data-check data-cleaning data-processing data-validation data-verification dataframe-schema dataframes hypothesis-testing pandas pandas-dataframe pandas-validation pandas-validator schema testing testing-tools validation
Last synced: 28 Apr 2024
https://github.com/cleanlab/cleanlab
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
active-learning annotation data-analysis data-centric-ai data-cleaning data-curation data-labeling data-profiling data-quality data-science data-validation dataops dataquality datasets labeling llms noisy-labels out-of-distribution-detection outlier-detection weak-supervision
Last synced: 28 Apr 2024
https://github.com/hi-primus/optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
big-data-cleaning bigdata cudf dask dask-cudf data-analysis data-cleaner data-cleaning data-cleansing data-exploration data-extraction data-preparation data-profiling data-science data-transformation data-wrangling machine-learning pyspark spark
Last synced: 28 Apr 2024
https://github.com/Desbordante/desbordante-core
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
anomaly-detection correlations data-analytics data-cleaning data-cleansing data-engineering data-exploration data-mining data-mining-algorithms data-preprocessing data-profiling data-science data-wrangling exploratory-data-analysis feature-engineering feature-extraction feature-selection knowledge-discovery spreadsheets tabular-data
Last synced: 21 Apr 2024
https://github.com/msamogh/nonechucks
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
data-cleaning data-pipeline data-preprocessing data-processing machine-learning preprocessing pytorch torch
Last synced: 19 Apr 2024
https://github.com/akanz1/klib
Easy to use Python library of customized functions for cleaning and analyzing data.
data-analysis data-cleaning data-preprocessing data-science data-visualization feature-selection klib python
Last synced: 16 Apr 2024
https://github.com/LukasHedegaard/datasetops
Fluent dataset operations, compatible with your favorite libraries
data-cleaning data-munging data-processing data-science data-wrangling dataset dataset-combinations deep-learning multiple-datasets pytorch tensorflow
Last synced: 16 Apr 2024
https://github.com/probcomp/PClean
A domain-specific probabilistic programming language for scalable Bayesian data cleaning
bayesian-inference data-cleaning data-cleansing probabilistic-graphical-models probabilistic-programming
Last synced: 13 Apr 2024
https://github.com/Mahima1729/Data-Professional-Survey-Analysis
Analysis and Powerbi Dashboarding of data collected from survey on Data Professionals.
data-cleaning data-visualization dax excel power-query powerbi
Last synced: 10 Apr 2024
https://github.com/jmcastagnetto/covid-19-data-cleanup
Scripts to cleanup data from https://github.com/CSSEGISandData/COVID-19
covid-19 covid-19-data data-cleaning data-visualization datasets r
Last synced: 09 Apr 2024
https://github.com/ECNU-ICALK/EduChat
An open-source educational chat model from ICALK, East China Normal University. 开源中英教育对话大模型。(通用基座模型,GPU部署,数据清理) 致敬: LLaMA, MOSS, BELLE, Ziya, vLLM
belle chinese-nlp data-cleaning education llama llm moss open-models
Last synced: 07 Apr 2024
https://github.com/sharad461/nepali-translator
Neural Machine Translation on the Nepali-English language pair
data-cleaning machine-translation nepali-english parallel-corpus
Last synced: 02 Apr 2024
https://github.com/jananiarunachalam/Data-Science-Portfolio
Data Science Projects Repository
analytics api data-cleaning data-science data-visualization databases deep-learning excel machine-learning numpy pandas plotly predictive-modeling python3 r r-programming sql
Last synced: 01 Apr 2024
https://github.com/justmarkham/pandas-videos
Jupyter notebook and datasets from the pandas video series
data-analysis data-cleaning data-science jupyter-notebook pandas python tutorial
Last synced: 27 Mar 2024
https://github.com/voxel51/fiftyone
The open-source tool for building high-quality datasets and computer vision models
active-learning artificial-intelligence computer-vision data-centric-ai data-cleaning data-curation data-quality data-science deep-learning developer-tools image-classification machine-learning object-detection python unstructured-data vector-search visualization
Last synced: 27 Mar 2024
https://github.com/jim-schwoebel/voicebook
🗣️ A book and repo to get you started programming voice computing applications in Python (10 chapters and 200+ scripts).
data data-cleaning encryption-decryption featurization generation machine-learning python3 security server transcription visualization voice voice-activity-detection voice-assistant voice-computing voice-control voice-recognition voice-recording wake-word-detection
Last synced: 27 Mar 2024
https://github.com/sfirke/janitor
simple tools for data cleaning in R
data-analysis data-cleaning data-science dirty-data excel pivot-tables r spss tabulations tidyverse
Last synced: 26 Mar 2024
https://github.com/data-cleaning/validate
Professional data validation for the R environment
Last synced: 26 Mar 2024
https://github.com/ekstroem/dataMaid
An R package for data screening
data-cleaning data-screening reproducible-research
Last synced: 21 Mar 2024
https://github.com/genomoncology/FuzzTypes
Pydantic extension for annotating autocorrecting fields.
data-cleaning fuzzy-string-matching named-entity-linking pydantic
Last synced: 20 Mar 2024
https://github.com/aai-institute/pyDVL
pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
banzhaf-index data-centric-ai data-cleaning data-pruning data-quality data-valuation game-theory influence-functions least-core machine-learning robust-machine-learning shapley-value transferlab
Last synced: 14 Mar 2024