Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/ChrisMuir/refinr

Cluster and merge similar string values: an R implementation of Open Refine clustering algorithms

approximate-string-matching clustering cran data-cleaning data-clustering fuzzy-matching ngram openrefine r rstats

Last synced: 21 Jun 2024

https://github.com/data-cleaning/errorlocate

Find and replace erroneous fields in data using validation rules

data-cleaning errors invalidation r

Last synced: 10 Jun 2024

https://github.com/carlosstack/cleaning-data-in-r

:pick: :gem: Taking raw data and converting it to tidy data that can be used for later analysis

data-cleaning data-manipulation r-programming regular-expression

Last synced: 10 Jun 2024

https://github.com/data-cleaning/dcmodifydb

Deterministic, documented correction rules on a database

correction data-cleaning database r

Last synced: 04 Jun 2024

https://github.com/ropensci/taxa

taxonomic classes for R

data-cleaning r r-package rstats taxon taxonomy

Last synced: 04 Jun 2024

https://github.com/data-cleaning/validatesuggest

Generate validation rules from data

data-cleaning r validation

Last synced: 03 Jun 2024

https://github.com/marksweiss/sofine

Lightweight framework for creating data-collecting plugins and chaining calls to them from CLI, REST or Python to return unified data sets.

cross-language data-cleaning data-processing data-retrieval json python

Last synced: 03 Jun 2024

https://github.com/msberends/clean

Fast and Easy Data Cleaning (in R)

data-cleaning r

Last synced: 20 May 2024

https://github.com/schema-inspector/schema-inspector

Schema-Inspector is a simple JavaScript object sanitization and validation module.

data-cleaning javascript sanitization validation

Last synced: 16 May 2024

https://github.com/Desbordante/desbordante-core

Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

anomaly-detection correlations data-analytics data-cleaning data-cleansing data-engineering data-exploration data-mining data-mining-algorithms data-preprocessing data-profiling data-science data-wrangling exploratory-data-analysis feature-engineering feature-extraction feature-selection knowledge-discovery spreadsheets tabular-data

Last synced: 21 Apr 2024

https://github.com/msamogh/nonechucks

Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!

data-cleaning data-pipeline data-preprocessing data-processing machine-learning preprocessing pytorch torch

Last synced: 19 Apr 2024

https://github.com/akanz1/klib

Easy to use Python library of customized functions for cleaning and analyzing data.

data-analysis data-cleaning data-preprocessing data-science data-visualization feature-selection klib python

Last synced: 16 Apr 2024

https://github.com/probcomp/PClean

A domain-specific probabilistic programming language for scalable Bayesian data cleaning

bayesian-inference data-cleaning data-cleansing probabilistic-graphical-models probabilistic-programming

Last synced: 13 Apr 2024

https://github.com/Mahima1729/Data-Professional-Survey-Analysis

Analysis and Powerbi Dashboarding of data collected from survey on Data Professionals.

data-cleaning data-visualization dax excel power-query powerbi

Last synced: 10 Apr 2024

https://github.com/jmcastagnetto/covid-19-data-cleanup

Scripts to cleanup data from https://github.com/CSSEGISandData/COVID-19

covid-19 covid-19-data data-cleaning data-visualization datasets r

Last synced: 09 Apr 2024

https://github.com/ECNU-ICALK/EduChat

An open-source educational chat model from ICALK, East China Normal University. 开源中英教育对话大模型。(通用基座模型,GPU部署,数据清理) 致敬: LLaMA, MOSS, BELLE, Ziya, vLLM

belle chinese-nlp data-cleaning education llama llm moss open-models

Last synced: 07 Apr 2024

https://github.com/sharad461/nepali-translator

Neural Machine Translation on the Nepali-English language pair

data-cleaning machine-translation nepali-english parallel-corpus

Last synced: 02 Apr 2024

https://github.com/justmarkham/pandas-videos

Jupyter notebook and datasets from the pandas video series

data-analysis data-cleaning data-science jupyter-notebook pandas python tutorial

Last synced: 27 Mar 2024

https://github.com/data-cleaning/validate

Professional data validation for the R environment

data-cleaning r validation

Last synced: 26 Mar 2024

https://github.com/ekstroem/dataMaid

An R package for data screening

data-cleaning data-screening reproducible-research

Last synced: 21 Mar 2024

https://github.com/genomoncology/FuzzTypes

Pydantic extension for annotating autocorrecting fields.

data-cleaning fuzzy-string-matching named-entity-linking pydantic

Last synced: 20 Mar 2024

https://github.com/aai-institute/pyDVL

pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation

banzhaf-index data-centric-ai data-cleaning data-pruning data-quality data-valuation game-theory influence-functions least-core machine-learning robust-machine-learning shapley-value transferlab

Last synced: 14 Mar 2024