Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Projects in Awesome Lists tagged with data-cleaning

A curated list of projects in awesome lists tagged with data-cleaning .

https://github.com/justmarkham/pandas-videos

Jupyter notebook and datasets from the pandas video series

data-analysis data-cleaning data-science jupyter-notebook pandas python tutorial

Last synced: 20 Dec 2024

https://github.com/ECNU-ICALK/EduChat

An open-source educational chat model from ICALK, East China Normal University. 开源中英教育对话大模型。(通用基座模型,GPU部署,数据清理) 致敬: LLaMA, MOSS, BELLE, Ziya, vLLM

belle chinese-nlp data-cleaning education llama llm moss open-models

Last synced: 02 Nov 2024

https://github.com/schema-inspector/schema-inspector

Schema-Inspector is a simple JavaScript object sanitization and validation module.

data-cleaning javascript sanitization validation

Last synced: 20 Dec 2024

https://github.com/akanz1/klib

Easy to use Python library of customized functions for cleaning and analyzing data.

data-analysis data-cleaning data-preprocessing data-science data-visualization feature-selection klib python

Last synced: 15 Nov 2024

https://github.com/data-cleaning/validate

Professional data validation for the R environment

data-cleaning r validation

Last synced: 25 Oct 2024

https://github.com/msamogh/nonechucks

Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!

data-cleaning data-pipeline data-preprocessing data-processing machine-learning preprocessing pytorch torch

Last synced: 14 Nov 2024

https://github.com/probcomp/PClean

A domain-specific probabilistic programming language for scalable Bayesian data cleaning

bayesian-inference data-cleaning data-cleansing probabilistic-graphical-models probabilistic-programming

Last synced: 13 Nov 2024

https://github.com/genomoncology/FuzzTypes

Pydantic extension for annotating autocorrecting fields.

data-cleaning fuzzy-string-matching named-entity-linking pydantic

Last synced: 17 Nov 2024

https://github.com/ekstroem/datamaid

An R package for data screening

data-cleaning data-screening reproducible-research

Last synced: 03 Dec 2024

https://github.com/ekstroem/dataMaid

An R package for data screening

data-cleaning data-screening reproducible-research

Last synced: 13 Nov 2024

https://github.com/hi-primus/bumblebee

🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)

bumblebee cudf dask dask-cudf data-cleaning data-preparation data-profiling datasets gpu gui optimus prepare-data python

Last synced: 12 Nov 2024

https://github.com/iam-mhaseeb/skytrax-data-warehouse

A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.

airflow data-analysis data-analytics data-cleaning data-engineering data-orchestration data-processing data-visualization data-warehouse data-warehousing database docker metabase python python3 redshift s3 s3-bucket sql

Last synced: 14 Dec 2024

https://github.com/chrismuir/refinr

Cluster and merge similar string values: an R implementation of Open Refine clustering algorithms

approximate-string-matching clustering cran data-cleaning data-clustering fuzzy-matching ngram openrefine r rstats

Last synced: 18 Dec 2024

https://github.com/ChrisMuir/refinr

Cluster and merge similar string values: an R implementation of Open Refine clustering algorithms

approximate-string-matching clustering cran data-cleaning data-clustering fuzzy-matching ngram openrefine r rstats

Last synced: 26 Oct 2024

https://github.com/aai-institute/pyDVL

pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation

banzhaf-index data-centric-ai data-cleaning data-pruning data-quality data-valuation game-theory influence-functions least-core machine-learning robust-machine-learning shapley-value transferlab

Last synced: 17 Nov 2024

https://github.com/lolei/redditcleaner

Cleans Reddit Text Data :scroll: :broom:

data-cleaning hacktoberfest nlp praw psaw pushshift python reddit text-data

Last synced: 15 Dec 2024

https://github.com/sail-sg/sailcraft

Data Toolkit for Sailor Language Models

data-cleaning data-deduplication

Last synced: 07 Nov 2024

https://github.com/renumics/sliceguard

A library for detecting problematic data segments in structured and unstructured data with few lines of code.

data-analysis data-cleaning data-curation data-exploration data-science data-visualization deep-learning eda exploratory-data-analysis machine-learning python visualization

Last synced: 27 Oct 2024

https://github.com/Desbordante/desbordante-core

Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

anomaly-detection correlations data-analytics data-cleaning data-cleansing data-engineering data-exploration data-mining data-mining-algorithms data-preprocessing data-profiling data-science data-wrangling exploratory-data-analysis feature-engineering feature-extraction feature-selection knowledge-discovery spreadsheets tabular-data

Last synced: 04 Nov 2024

https://github.com/msberends/clean

Fast and Easy Data Cleaning (in R)

data-cleaning r

Last synced: 08 Nov 2024

https://github.com/ropensci/taxa

taxonomic classes for R

data-cleaning r r-package rstats taxon taxonomy

Last synced: 04 Dec 2024

https://github.com/elysian01/data-purifier

A Python library for Automated Exploratory Data Analysis, Automated Data Cleaning, and Automated Data Preprocessing For Machine Learning and Natural Language Processing Applications in Python.

data-analysis data-cleaning data-cleaning-pipeline data-preprocessing data-science data-visualization datapurifier eda exploratory-data-analysis jupyter python-lib python-library python3

Last synced: 07 Nov 2024

https://github.com/ammsa/dtcleaner

DTCleaner: data cleaning using multi-target decision trees.

data-cleaning data-mining data-preprocessing data-quality data-science data-wrangling

Last synced: 28 Oct 2024

https://github.com/theronione/cleaner.jl

A toolbox of simple solutions for common data cleaning problems.

data data-cleaning julia

Last synced: 12 Oct 2024

https://github.com/jmcastagnetto/covid-19-data-cleanup

Scripts to cleanup data from https://github.com/CSSEGISandData/COVID-19

covid-19 covid-19-data data-cleaning data-visualization datasets r

Last synced: 08 Nov 2024

https://github.com/facultyai/boltzmannclean

Fill missing values in Pandas DataFrames using Restricted Boltzmann Machines

data-cleaning data-science dataframe pandas restricted-boltzmann-machine

Last synced: 08 Nov 2024

https://github.com/data-cleaning/errorlocate

Find and replace erroneous fields in data using validation rules

data-cleaning errors invalidation r

Last synced: 04 Dec 2024

https://github.com/amine-smahi/r-learning-journey

Some of the projects i made when starting to learn R for Data Science at the university

afc cpa data-cleaning data-integration data-science datascience r r-language

Last synced: 27 Oct 2024

https://github.com/catalyst/moodle-local_datacleaner

Reduce, filter, and anonymize moodle data for non-prod environments

anonymize data-cleaning datacleaner moodle php plugin

Last synced: 11 Nov 2024

https://github.com/aifred-health/vulcanai

A high level deep learning framework for quickly prototyping networks with added tools in data visualisation, model interpretability and performance metrics

data-analysis data-cleaning data-science data-visualization deep-learning deep-neural-networks feature-engineering mental-health python3 pytorch scikit-learn

Last synced: 05 Dec 2024

https://github.com/facultyai/ipydataclean

Interactive cleaning for Pandas DataFrames

data-cleaning data-science dataframe jupyter-notebook pandas

Last synced: 28 Oct 2024

https://github.com/jkminder/data2neo

Data2Neo is a library that simplifies the conversion of data in relational format to a graph knowledge database.

data-cleaning data-conversion data-engineering data2neo database-migrations graphs neo4j relational-databases remodeling

Last synced: 14 Oct 2024

https://github.com/chinmayrane16/titanic-survival-in-depth-analysis

Used Pandas , Matplotlib , Seaborn libraries to Analyze , Visualize and Explore the data of people travelling on Titanic, and Used Scikit-learn Modelling Algorithms to predict their probability of Survival.

classification-model data-cleaning data-visualization feature-engineering matplotlib numpy pandas seaborn

Last synced: 27 Oct 2024

https://github.com/firaskahlaoui/heart-disease-prediction

The Heart Disease Prediction project aims to predict the likelihood of heart disease using machine learning techniques.

data-cleaning data-visualization flask jupyter-notebook kaggle-dataset model-building python3

Last synced: 15 Nov 2024

https://github.com/kemingy/plane

A text processing tool including tag(HTML, URL, Email) extraction and removing, punctuation normalization, simple segmentation, and so on.

chinese-nlp data-cleaning nlp preprocess regex tokenization tokenizer

Last synced: 27 Oct 2024

https://github.com/jchehe/xcel

【项目已迁移到团队github】因此该 repository 只会同步最新的 README.md,若需要 watch、Star、Fork,则去团队的 github。谢谢。

data-cleaning electron vue

Last synced: 11 Nov 2024

https://github.com/jay0lee/cmdc

Chrome Managed Data Cleanup - https://chrome.google.com/webstore/detail/chrome-managed-data-clean/anfhmiaflneaeffhlmbcedfjakdlpleg

cache cookies data-cleaning g-suite google-chrome google-chrome-extension javascript

Last synced: 24 Oct 2024

https://github.com/epiverse-trace/cleanepi

R package to clean and standardize epidemiological data

data-cleaning epidemiology epiverse r r-package

Last synced: 02 Dec 2024

https://github.com/sayakpaul/analytics-vidhya-game-of-deep-learning-hackathon

Contains my experiments for the Game of Deep Learning Hackathon conducted by Analytics Vidhya

active-learning analytics-vidhya computer-vision data-cleaning deep-learning fastai label-noise

Last synced: 23 Oct 2024

https://github.com/waynejz/comp9321-19t1

COMP9321 Data Services Engineering 2019T1

backend data-cleaning data-services data-visualization

Last synced: 18 Dec 2024

https://github.com/yaph/james-bond-actors

Script to grab Freebase data about James Bond actors and generate gexf data file.

data-cleaning data-processing data-retrieval freebase james-bond-actors network-graph

Last synced: 03 Dec 2024

https://github.com/marksweiss/sofine

Lightweight framework for creating data-collecting plugins and chaining calls to them from CLI, REST or Python to return unified data sets.

cross-language data-cleaning data-processing data-retrieval json python

Last synced: 04 Dec 2024

https://github.com/incubated-geek-cc/text-manipulation

A browser-based text-manipulation toolkit. No server required. Re-designed version of https://textmechanic.com/

css data-cleaning html javascript productivity text-editor

Last synced: 15 Nov 2024

https://github.com/vida-nyu/openclean-core

Data Cleaning and Data Profiling Library for Python

data-cleaning data-curation hacktoberfest

Last synced: 24 Nov 2024

https://github.com/siddeshsambasivam/ntuoss-datascraping-and-datacleaning-workshop

This repository contains the reference scripts and the content presented in the NTU OSS Data scraping and Data cleaning workshop.

data-cleaning data-crawling data-scraping

Last synced: 24 Oct 2024

https://github.com/hrolive/from-data-to-insights-with-google-cloud-platform

Four-course accelerated online specialization teaches course participants how to derive insights through data analysis and visualization using the Google Cloud Platform

data-analysis data-cleaning data-preparation data-visualization sql

Last synced: 09 Nov 2024

https://github.com/data-cleaning/dcmodifydb

Deterministic, documented correction rules on a database

correction data-cleaning database r

Last synced: 04 Dec 2024

https://github.com/memgonzales/pisa-2018-analysis

Jupyter notebook presenting the process of data preparation, research question formulation, data analysis, and data modelling with the goal of extracting insights from the 2018 PISA Dataset

data-cleaning data-modeling data-science data-visualization exploratory-data-analysis jupyter-notebook matplotlib numpy oecd-data pandas pisa scipy statistical-inference

Last synced: 19 Nov 2024

https://github.com/data-cleaning/validatesuggest

Generate validation rules from data

data-cleaning r validation

Last synced: 04 Dec 2024

https://github.com/cbozan/graduation-project

Graduation project categorizes popular search phrases using Python and Spark and presents them on a website to inspire creators.

crisp-dm data-cleaning data-science machine-learning nlp nlp-machine-learning spark spark-mllib

Last synced: 23 Nov 2024

https://github.com/depressioncenter/mden

Mobile technologies code from the University of Michigan's Mobile Data Experts Network (MDEN), featuring data cleaning automations, REDCap project templates, and links to useful external modules. [DOI: 10.6084/m9.figshare.25438714]

automation data-analysis data-cleaning fitness-tracker heart-rate-data mobile-data mobile-development mquery powerautomate powerbi powerquery python r sleep-data smartwatch-data tableau

Last synced: 25 Nov 2024

https://github.com/fbraza/python-tocase

A library to help recasing your strings

case-converter data-cleaning pandas python python3 strings-manipulation

Last synced: 07 Dec 2024

https://github.com/srinivasrm/mutual-funds-analysis-and-prediction

In this project I have performed analysis and prediction on 1,3,and 5 year returns on 1064 mutual funds in India. I have scraped data from a website which is the most visited website for mutual fund investments.I have tested regression models linear model,SGD Regressor , Random Forest Regressor,Decision Tree Regressor,Ridge,MLP Regressor and linear model (Lasso).After which I have selected the best perorming model and performed Hyper parameter tuning and then deployed an interactive application which can generate the visualization and send an email with the visualization to the users email address.

beautifulsoup data-analysis data-base data-cleaning data-science deployment etl finanace frontend funds machine-learning mutual mutual-funds pgsql python scikit-learn sql streamlit web webapplication

Last synced: 11 Oct 2024

https://github.com/yaph/world-aid-transparency

World aid transparency data scripts for creating a visualization with D3

competition-project data-cleaning data-processing data-retrieval data-visualization worldbank

Last synced: 03 Dec 2024

https://github.com/yaph/gh-commit-locations

Scripts used for analyzing GitHub commit locations to create a map visualization

big-query data-challenge data-cleaning data-mining data-processing data-visualization github information-retrieval user-location world-map

Last synced: 03 Dec 2024

https://github.com/kwokhing/exploratory-data-analysis-on-smrt-tweets

Demo on performing exploratory data analysis (EDA) on train service disruptions based on scrapped (user generated contents) tweets from the train operator's (SMRT) twitter account

data-analysis data-cleaning data-collection data-preparation exploratory-data-analysis exploratory-data-visualizations folium geospatial-data leaflet-map python python3 regex scraping selenium selenium-python social-media text-processing user-generated-content web-scraping webscraping

Last synced: 02 Dec 2024

https://github.com/kwokhing/network-analysis-on-mrt-station

Demo on applying the concept of network analysis on a network of connected railway stations, attempting to identify the important stations (nodes) in this network. Web scraping techniques using rvest package is also briefly discussed upon.

betweenness-centrality closeness-centrality data-cleaning degree-centrality eigenvector-centrality gephi graph-analysis igraph r rvest social-network-analysis social-networks web-scraping xpath

Last synced: 02 Dec 2024

https://github.com/nragland37/event-optimization-tool

R-based Shiny application that maps availability and identifies optimal engagement times to enhance participation within an organization

data-analysis data-cleaning data-preparation heatmap r shiny shiny-app tidyverse

Last synced: 16 Nov 2024

https://github.com/baimamboukar/python_data_cleaning

Data cleaning automation for emails in csv and excel files

automation csv data-cleaning excel oop-principles python3

Last synced: 12 Nov 2024

https://github.com/bgreenwell/bpa

Basic pattern analysis in R

basic-pattern-analysis data-cleaning r standardization

Last synced: 16 Oct 2024

https://github.com/easonlai/pii-data-scrubber

This is demo repo to demonstrate how to leverage Azure Text Analytics to perform Personally identifiable information (PII) data scrubbing by Python (Jupyter Notebook). This is important part of data wrangling/data cleaning.

azure azure-cognitive-services azure-text-analysis azure-text-analytics data-cleaning data-wrangling jupyter-notebook jypyter microsoft-azure microsoft-cognitive-services pandas pandas-dataframe pii-data pii-data-scrub pii-data-scrubber pii-data-scrubbing piidata python python3

Last synced: 10 Nov 2024

https://github.com/kaustubhgupta/google-fit-data-analysis

This is the notebook code and the dataset for the Google Fit Analysis I did for Analytics Vidhya Blog.

data-cleaning data-visualization demo google plotly voila

Last synced: 29 Nov 2024

https://github.com/cgnorthcutt/reliablity_framework_for_rag

Demo showing how the Trustworthy Language Model add reliability to LLM outputs and improves RAG, agents, and data enrichment worfklows. can be used to improve fine-tuning of LLMs, accuracy of LLM outputs, and smart routing for RAG and agents.

chatgpt data-cleaning data-curation data-observability data-quality llms observability rag

Last synced: 03 Dec 2024