Projects in Awesome Lists tagged with data-cleaning
A curated list of projects in awesome lists tagged with data-cleaning .
https://github.com/cleanlab/cleanlab
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
active-learning annotation data-centric-ai data-cleaning data-curation data-labeling data-profiling data-quality data-science data-validation dataops dataquality datasets exploratory-data-analysis labeling llms noisy-labels out-of-distribution-detection outlier-detection weak-supervision
Last synced: 12 May 2025
https://github.com/voxel51/fiftyone
Refine high-quality datasets and visual AI models
active-learning artificial-intelligence computer-vision data-centric-ai data-cleaning data-curation data-quality data-science deep-learning developer-tools image-classification machine-learning object-detection python unstructured-data vector-search visualization
Last synced: 12 May 2025
https://github.com/johnkerl/miller
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
command-line command-line-tools csv csv-format data-cleaning data-processing data-reduction data-regression devops devops-tools json json-data miller statistical-analysis statistics streaming-algorithms streaming-data tabular-data tsv unix-toolkit
Last synced: 14 May 2025
https://github.com/unionai-oss/pandera
A light-weight, flexible, and expressive statistical data testing library
assertions data-assertions data-check data-cleaning data-processing data-validation data-verification dataframe-schema dataframes hypothesis-testing pandas pandas-dataframe pandas-validation pandas-validator schema testing testing-tools validation
Last synced: 12 Dec 2025
https://github.com/justmarkham/pandas-videos
Jupyter notebook and datasets from the pandas video series
data-analysis data-cleaning data-science jupyter-notebook pandas python tutorial
Last synced: 15 May 2025
https://github.com/justmarkham/dat8
General Assembly's 2015 Data Science course in Washington, DC
clustering course data-analysis data-cleaning data-science data-visualization decision-trees ensemble-learning jupyter-notebook linear-regression logistic-regression machine-learning model-evaluation naive-bayes natural-language-processing pandas python regular-expressions scikit-learn web-scraping
Last synced: 15 May 2025
https://github.com/hi-primus/optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
big-data-cleaning bigdata cudf dask dask-cudf data-analysis data-cleaner data-cleaning data-cleansing data-exploration data-extraction data-preparation data-profiling data-science data-transformation data-wrangling machine-learning pyspark spark
Last synced: 14 May 2025
https://github.com/sfirke/janitor
simple tools for data cleaning in R
data-analysis data-cleaning data-science dirty-data excel pivot-tables r spss tabulations tidyverse
Last synced: 13 May 2025
https://github.com/skrub-data/skrub
Machine learning with dataframes
data data-analysis data-cleaning data-preparation data-preprocessing data-science data-wrangling dataframe dataframes dirty-data machine-learning
Last synced: 06 Jan 2026
https://github.com/data-forge/data-forge-ts
The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
csv data data-analysis data-cleaning data-cleansing data-forge data-management data-manipulation data-munging data-visualization data-wrangling javascript json linq nodejs pandas visualization
Last synced: 13 May 2025
https://github.com/ECNU-ICALK/EduChat
An open-source educational chat model from ICALK, East China Normal University. 开源中英教育对话大模型。(通用基座模型,GPU部署,数据清理) 致敬: LLaMA, MOSS, BELLE, Ziya, vLLM
belle chinese-nlp data-cleaning education llama llm moss open-models
Last synced: 01 Apr 2025
https://github.com/schema-inspector/schema-inspector
Schema-Inspector is a simple JavaScript object sanitization and validation module.
data-cleaning javascript sanitization validation
Last synced: 14 May 2025
https://github.com/akanz1/klib
Easy to use Python library of customized functions for cleaning and analyzing data.
data-analysis data-cleaning data-preprocessing data-science data-visualization feature-selection klib python
Last synced: 21 Oct 2025
https://github.com/desbordante/desbordante-core
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
anomaly-detection correlations data-analytics data-cleaning data-cleansing data-engineering data-exploration data-mining data-mining-algorithms data-preprocessing data-profiling data-science data-wrangling exploratory-data-analysis feature-engineering feature-extraction feature-selection knowledge-discovery spreadsheets tabular-data
Last synced: 22 Nov 2025
https://github.com/data-cleaning/validate
Professional data validation for the R environment
Last synced: 21 Oct 2025
https://github.com/Desbordante/desbordante-core
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
anomaly-detection correlations data-analytics data-cleaning data-cleansing data-engineering data-exploration data-mining data-mining-algorithms data-preprocessing data-profiling data-science data-wrangling exploratory-data-analysis feature-engineering feature-extraction feature-selection knowledge-discovery spreadsheets tabular-data
Last synced: 03 Apr 2025
https://github.com/jim-schwoebel/voicebook
🗣️ A book and repo to get you started programming voice computing applications in Python (10 chapters and 200+ scripts).
data data-cleaning encryption-decryption featurization generation machine-learning python3 security server transcription visualization voice voice-activity-detection voice-assistant voice-computing voice-control voice-recognition voice-recording wake-word-detection
Last synced: 06 Apr 2025
https://github.com/msamogh/nonechucks
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
data-cleaning data-pipeline data-preprocessing data-processing machine-learning preprocessing pytorch torch
Last synced: 07 May 2025
https://github.com/rasgointelligence/feature-engineering-tutorials
Data Science Feature Engineering and Selection Tutorials
data-cleaning data-science exploratory-data-analysis feature-engineering feature-selection features jupyter machine-learning notebook pandas pandas-profiling pyrasgo python scikit-learn sweetviz tutorial tutorials xgboost
Last synced: 14 Jun 2025
https://github.com/cambioml/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering
LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster R&D!
data-cleaning generative-ai llm
Last synced: 11 Oct 2025
https://github.com/probcomp/pclean
A domain-specific probabilistic programming language for scalable Bayesian data cleaning
bayesian-inference data-cleaning data-cleansing probabilistic-graphical-models probabilistic-programming
Last synced: 08 May 2025
https://github.com/genomoncology/FuzzTypes
Pydantic extension for annotating autocorrecting fields.
data-cleaning fuzzy-string-matching named-entity-linking pydantic
Last synced: 11 May 2025
https://github.com/probcomp/PClean
A domain-specific probabilistic programming language for scalable Bayesian data cleaning
bayesian-inference data-cleaning data-cleansing probabilistic-graphical-models probabilistic-programming
Last synced: 04 May 2025
https://github.com/charlesdedampierre/BunkaTopics
🗺️ Data Cleaning and Textual Data Visualization 🗺️
cartography data-cleaning explainability fine-tuning llms machine-learning natural-language-processing nlp summarization topic-modeling
Last synced: 30 Aug 2025
https://github.com/ekstroem/datamaid
An R package for data screening
data-cleaning data-screening reproducible-research
Last synced: 09 Apr 2025
https://github.com/ekstroem/dataMaid
An R package for data screening
data-cleaning data-screening reproducible-research
Last synced: 06 May 2025
https://github.com/hi-primus/bumblebee
🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)
bumblebee cudf dask dask-cudf data-cleaning data-preparation data-profiling datasets gpu gui optimus prepare-data python
Last synced: 02 May 2025
https://github.com/jim-schwoebel/allie
🤖 An automated machine learning framework for audio, text, image, video, or .CSV files (50+ featurizers and 15+ model trainers). Python 3.6 required.
autokeras automl autopytorch data-augmentation data-cleaning data-cleaning-pipeline data-transformation data-visualization datasets deep-learning ludwig machine-learning machine-learning-api machine-learning-library machine-learning-models model-compression model-deployment tpot voice-computing
Last synced: 21 Aug 2025
https://github.com/iam-mhaseeb/skytrax-data-warehouse
A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.
airflow data-analysis data-analytics data-cleaning data-engineering data-orchestration data-processing data-visualization data-warehouse data-warehousing database docker metabase python python3 redshift s3 s3-bucket sql
Last synced: 12 Aug 2025
https://github.com/datawithbaraa/sql-data-warehouse-project
A comprehensive guide to building a modern data warehouse with SQL Server, including ETL processes, data modeling, and analytics.
data-analysis data-analytics data-cleaning data-engineering data-lakehouse data-science data-warehouse data-warehousing datalake datascience datawarehouse datawarehousing etl etl-job etl-pipeline medallion-architecture sql sql-query sql-server sqlserver
Last synced: 06 Apr 2025
https://github.com/ChrisMuir/refinr
Cluster and merge similar string values: an R implementation of Open Refine clustering algorithms
approximate-string-matching clustering cran data-cleaning data-clustering fuzzy-matching ngram openrefine r rstats
Last synced: 15 Mar 2025
https://github.com/chrismuir/refinr
Cluster and merge similar string values: an R implementation of Open Refine clustering algorithms
approximate-string-matching clustering cran data-cleaning data-clustering fuzzy-matching ngram openrefine r rstats
Last synced: 08 Sep 2025
https://github.com/aai-institute/pyDVL
pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
banzhaf-index data-centric-ai data-cleaning data-pruning data-quality data-valuation game-theory influence-functions least-core machine-learning robust-machine-learning shapley-value transferlab
Last synced: 11 May 2025
https://github.com/sail-sg/sailcraft
🚢 Data Toolkit for Sailor Language Models
data-cleaning data-deduplication
Last synced: 05 Oct 2025
https://github.com/lolei/redditcleaner
Cleans Reddit Text Data :scroll: :broom:
data-cleaning hacktoberfest nlp praw psaw pushshift python reddit text-data
Last synced: 22 Jul 2025
https://github.com/cosbidev/pytrack
a Map-Matching-based Python Toolbox for Vehicle Trajectory Reconstruction
computer-vision data-cleaning gps-tracker graph intelligent-transportation-systems map-match map-matching maps network-graph networkx openstreetmap python snapping street-view topology tracking trajectory-analysis visualization
Last synced: 13 Oct 2025
https://github.com/renumics/sliceguard
A library for detecting problematic data segments in structured and unstructured data with few lines of code.
data-analysis data-cleaning data-curation data-exploration data-science data-visualization deep-learning eda exploratory-data-analysis machine-learning python visualization
Last synced: 16 Mar 2025
https://github.com/akvo/akvo-lumen
Make sense of your data
agplv3 akvo akvo-lumen clojure d3 data-cleaning data-visualization react
Last synced: 29 Aug 2025
https://github.com/rvanasa/pandas-gpt
Power up your data science workflow with ChatGPT.
chatgpt claude-ai data-cleaning data-engineering data-science data-visualization gemini generative-ai gpt4 jupyter-notebook litellm low-code matplotlib numpy o1 openai pandas productivity scipy seaborn
Last synced: 09 May 2025
https://github.com/laureberti/learn2clean
Learn2Clean: Optimizing the Sequence of Tasks for Data Preparation and Cleaning
automated data-cleaning data-cleaning-pipeline data-curation data-preprocessing reinforcement-learning
Last synced: 11 Sep 2025
https://github.com/ropensci/taxa
taxonomic classes for R
data-cleaning r r-package rstats taxon taxonomy
Last synced: 21 Oct 2025
https://github.com/elysian01/data-purifier
A Python library for Automated Exploratory Data Analysis, Automated Data Cleaning, and Automated Data Preprocessing For Machine Learning and Natural Language Processing Applications in Python.
data-analysis data-cleaning data-cleaning-pipeline data-preprocessing data-science data-visualization datapurifier eda exploratory-data-analysis jupyter python-lib python-library python3
Last synced: 04 Oct 2025
https://github.com/mramshaw/data-cleaning
Data Cleaning with Python
data-cleaning data-munging data-wrangling numpy pandas python python3
Last synced: 21 Aug 2025
https://github.com/ammsa/dtcleaner
DTCleaner: data cleaning using multi-target decision trees.
data-cleaning data-mining data-preprocessing data-quality data-science data-wrangling
Last synced: 21 Mar 2025
https://github.com/theronione/cleaner.jl
A toolbox of simple solutions for common data cleaning problems.
Last synced: 24 Oct 2025
https://github.com/mrankitgupta/sales-insights-data-analysis-using-tableau-and-sql
India based Hardware company Sales Insights - A Data Analysis Project performed on Tableau & SQL
66daysofdata analysis analytics ankitgupta data-analysis data-cleaning data-science data-visualization excel mrankitgupta mysql powerbi rdbms sql sql-server statistics tableau tableau-dashboards tableau-desktop tableau-public
Last synced: 13 Aug 2025
https://github.com/cleanlab/cleanlab-studio
Client interface to Cleanlab Studio and the Trustworthy Language Model
annotations automl computer-vision data-centric-ai data-cleaning data-curation data-labeling data-profiling data-quality data-science data-validation image-classification llm machine-learning model-deployment natural-language-processing noisy-labels outlier-detection structured-data text-classification
Last synced: 13 Apr 2025
https://github.com/irsol/udacity-bertelsmann-data-science-challenge-scholarship-2018
This is a repo for my Bertelsmann Data Science Scholarship Challenge: notes, exercises, quizzes.
aggregation bertelsmann challenge control-flow data-cleaning data-science data-visualization python scholarship sql statistics udacity udacity-course udacity-scholarship-course udacity2018 variability
Last synced: 23 Mar 2025
https://github.com/jmcastagnetto/covid-19-data-cleanup
Scripts to cleanup data from https://github.com/CSSEGISandData/COVID-19
covid-19 covid-19-data data-cleaning data-visualization datasets r
Last synced: 17 Apr 2025
https://github.com/facultyai/boltzmannclean
Fill missing values in Pandas DataFrames using Restricted Boltzmann Machines
data-cleaning data-science dataframe pandas restricted-boltzmann-machine
Last synced: 27 Jun 2025
https://github.com/the-hull/datacleanr
Interactive and Reproducible Data Cleaning
annotation-tool data-cleaning outlier-detection outlier-removal reproducibility
Last synced: 22 Oct 2025
https://github.com/the-Hull/datacleanr
Interactive and Reproducible Data Cleaning
annotation-tool data-cleaning outlier-detection outlier-removal reproducibility
Last synced: 30 Jul 2025
https://github.com/data-cleaning/errorlocate
Find and replace erroneous fields in data using validation rules
data-cleaning errors invalidation r
Last synced: 22 Oct 2025
https://github.com/jkminder/data2neo
Data2Neo is a library that simplifies the conversion of data in relational format to a graph knowledge database.
data-cleaning data-conversion data-engineering data2neo database-migrations graphs neo4j relational-databases remodeling
Last synced: 12 Apr 2025
https://github.com/rubydamodar/the-ultimate-pandas-bootcamp
Welcome to the Pandas for Data Science repository! This course is designed to take you from beginner to proficient in using Pandas, the powerful data manipulation library in Python. Whether you're just starting your data science journey or looking to sharpen your skills, this repository contains all the resources
beginner-friendly csv-data data-analysis data-cleaning data-manipulation data-science data-visualization dataframe exploratory-data-analysis jupyter-notebook machine-learning matplotlib numpy pandas python python-pandas series statistical-analysis time-series titanic-dataset
Last synced: 19 Apr 2025
https://github.com/amine-smahi/r-learning-journey
Some of the projects i made when starting to learn R for Data Science at the university
afc cpa data-cleaning data-integration data-science datascience r r-language
Last synced: 18 Mar 2025
https://github.com/bakdata/dedupe
Java DSL for (online) deduplication
data-cleaning data-cleansing deduplication duplicate-detection duplicate-removal
Last synced: 10 Apr 2025
https://github.com/catalyst/moodle-local_datacleaner
Reduce, filter, and anonymize moodle data for non-prod environments
anonymize data-cleaning datacleaner moodle php plugin
Last synced: 25 Jul 2025
https://github.com/santoshlite/quantclean
🧹 Quantclean is a program that reformats financial dataset to US Equity TradeBar (Quantconnect format)
algo-trading algorithmic-trading data-cleaning finance financial-data futures lean-engine ohlcv options quandl quant quantconnect quantitative-finance quantitative-trading stock-data stock-market stocks trading-algorithms trading-bot trading-strategies
Last synced: 14 Dec 2025
https://github.com/aifred-health/vulcanai
A high level deep learning framework for quickly prototyping networks with added tools in data visualisation, model interpretability and performance metrics
data-analysis data-cleaning data-science data-visualization deep-learning deep-neural-networks feature-engineering mental-health python3 pytorch scikit-learn
Last synced: 01 Aug 2025
https://github.com/bbva/mercury-dataschema
Utility package that, given a Pandas DataFrame, it uses the DataSchema class which auto-infers feature types and automatically calculates different statistics depending on the types.
analytics data data-cleaning data-processing data-science feature-engineering
Last synced: 21 Jun 2025
https://github.com/data-cleaning/validatetools
data-cleaning r rules validation
Last synced: 22 Oct 2025
https://github.com/facultyai/ipydataclean
Interactive cleaning for Pandas DataFrames
data-cleaning data-science dataframe jupyter-notebook pandas
Last synced: 26 Aug 2025
https://github.com/firaskahlaoui/heart-disease-prediction
The Heart Disease Prediction project aims to predict the likelihood of heart disease using machine learning techniques.
data-cleaning data-visualization flask jupyter-notebook kaggle-dataset model-building python3
Last synced: 14 Apr 2025
https://github.com/chinmayrane16/titanic-survival-in-depth-analysis
Used Pandas , Matplotlib , Seaborn libraries to Analyze , Visualize and Explore the data of people travelling on Titanic, and Used Scikit-learn Modelling Algorithms to predict their probability of Survival.
classification-model data-cleaning data-visualization feature-engineering matplotlib numpy pandas seaborn
Last synced: 11 Oct 2025
https://github.com/lukashedegaard/datasetops
Fluent dataset operations, compatible with your favorite libraries
data-cleaning data-munging data-processing data-science data-wrangling dataset dataset-combinations deep-learning multiple-datasets pytorch tensorflow
Last synced: 23 Apr 2025
https://github.com/LukasHedegaard/datasetops
Fluent dataset operations, compatible with your favorite libraries
data-cleaning data-munging data-processing data-science data-wrangling dataset dataset-combinations deep-learning multiple-datasets pytorch tensorflow
Last synced: 08 May 2025
https://github.com/aicorsair/dataquest-data-science-analysis-projects
A repository dedicated to storing guided projects completed while learning data science concepts with Dataquest.
classification-models cluster-analysis data-analysis data-analytics data-cleaning data-preparation data-preprocessing data-science data-visualization deep-learning excel feature-engineering machine-learning pandas-dataframe power-bi python-3 regression-models scikit-learn sql web-scraping
Last synced: 27 Oct 2025
https://github.com/kemingy/plane
A text processing tool including tag(HTML, URL, Email) extraction and removing, punctuation normalization, simple segmentation, and so on.
chinese-nlp data-cleaning nlp preprocess regex tokenization tokenizer
Last synced: 17 Mar 2025
https://github.com/incubated-geek-cc/text-manipulation
A browser-based text-manipulation toolkit. No server required. Re-designed version of https://textmechanic.com/
css data-cleaning html javascript productivity text-editor
Last synced: 14 Apr 2025
https://github.com/vida-nyu/openclean-core
Data Cleaning and Data Profiling Library for Python
data-cleaning data-curation hacktoberfest
Last synced: 10 Apr 2025
https://github.com/waynejz/comp9321-19t1
COMP9321 Data Services Engineering 2019T1
backend data-cleaning data-services data-visualization
Last synced: 18 Aug 2025
https://github.com/data-forge/data-forge-fs
This library contains the file system extensions to Data-Forge that allow it to directly read and write CSV and JSON files in Node.js
csv data data-analysis data-cleaning data-cleansing data-forge data-management data-manipulation data-munging data-visualization data-wrangling javascript json linq nodejs pandas visualization
Last synced: 04 Sep 2025
https://github.com/jchehe/xcel
【项目已迁移到团队github】因此该 repository 只会同步最新的 README.md,若需要 watch、Star、Fork,则去团队的 github。谢谢。
Last synced: 17 Jul 2025
https://github.com/ddayto21/nba-time-series-forecasts
This repo contains machine learning applications that use time-series forecasts to predict the probability of certain players winning the MVP award in the National Basketball Association
beautifulsoup4 data-cleaning machine-learning nba nba-mvp-prediction python requests-library-python
Last synced: 30 Apr 2025
https://github.com/jay0lee/cmdc
Chrome Managed Data Cleanup - https://chrome.google.com/webstore/detail/chrome-managed-data-clean/anfhmiaflneaeffhlmbcedfjakdlpleg
cache cookies data-cleaning g-suite google-chrome google-chrome-extension javascript
Last synced: 12 May 2025
https://github.com/saisurajmatta/bike-sales-excel-dashboard-project
Bike Sales Excel Dashboard Project: Analyzed and visualized sales data, cleaned datasets, and created interactive dashboards in Excel.
data-analysis data-analytics data-cleaning data-visualization excel excel-dashboard excel-data-analytics pivot-tables
Last synced: 11 Aug 2025
https://github.com/hypertextassassin0273/excel_data_organizer_and_cleaner-ds_project
Data Structures project in C++11 language, uses custom Vector & String structures with Move Semantics (Rule of Five)
cpp11 data-cleaning data-cleansing data-structure-projects data-structures data-structures-project data-wrangling ds-projects easy-project excel-operations move-semantics object-oriented-programming oop open-source open-source-code open-source-project rule-of-five string university-project vector
Last synced: 30 Jun 2025
https://github.com/benedekrozemberczki/av_ultimate_student_hunt
Solution for the Ultimate Student Hunt Challenge (1st place).
analytics-vidhya-competition competition data-cleaning data-engineering data-engineering-pipeline distributed-machine-learning driven-data extreme-gradient-boosting forecasting gradient-boosting kaggle machine-learning r student-hunt supervised-learning weather-forecast winning-entry xgboost
Last synced: 20 Jun 2025
https://github.com/jacobmarks/image-deduplication-plugin
Remove exact and approximate duplicates from your dataset in FiftyOne!
computer-vision data-cleaning deduplication fiftyone image-processing plugin python similarity
Last synced: 31 Oct 2025
https://github.com/abhifuturetech/eda-rollercoaster
This repository contains an exploratory data analysis (EDA) project focused on roller coasters. The project involved organizing, cleaning, and visualizing the data to gain insights into roller coasters' characteristics and performance.
data-cleaning data-visualization mathplotlib mysql-database numpy python seaborn
Last synced: 10 Aug 2025
https://github.com/datapreprocessing/datacleaning
Data Cleaning is a python package for data preprocessing. This cleans the CSV file and returns the cleaned data frame. It does the work of imputation, removing duplicates, replacing special characters, and many more.
data data-cleaning data-cleansing data-preprocessing data-wrangling imputation python threshold
Last synced: 14 Dec 2025
https://github.com/epiverse-trace/cleanepi
R package to clean and standardize epidemiological data
data-cleaning epidemiology epiverse r r-package
Last synced: 26 Jul 2025
https://github.com/brunocampos01/allstate-claims-severity
Udacity Machine Learning Engineer Nanodegree capstone proposal.
allstate capstone-proposal challenge data-analyst-nanodegree data-cleaning data-engineering data-science data-visualization dataset deep-learning kaggle machine-learning pca-analysis pt-br python udacity-machine-learning-nanodegree
Last synced: 15 Apr 2025
https://github.com/amey-thakur/kaggle
Kaggle Courses - All Exercises of the respective courses.
amey ameythakur courses data-cleaning data-manipulation data-science data-visualization deep-learning feature-engineering intro-to-ml kaggle machine-learning machine-learning-explainability python
Last synced: 25 Aug 2025
https://github.com/yaph/james-bond-actors
Script to grab Freebase data about James Bond actors and generate gexf data file.
data-cleaning data-processing data-retrieval freebase james-bond-actors network-graph
Last synced: 08 Sep 2025
https://github.com/marksweiss/sofine
Lightweight framework for creating data-collecting plugins and chaining calls to them from CLI, REST or Python to return unified data sets.
cross-language data-cleaning data-processing data-retrieval json python
Last synced: 29 Jul 2025
https://github.com/sayakpaul/analytics-vidhya-game-of-deep-learning-hackathon
Contains my experiments for the Game of Deep Learning Hackathon conducted by Analytics Vidhya
active-learning analytics-vidhya computer-vision data-cleaning deep-learning fastai label-noise
Last synced: 28 Jul 2025
https://github.com/spider-rs/readability
The readability library for LLM's
clean-data data-cleaning llm-training readability rust safari-reader
Last synced: 05 Apr 2025
https://github.com/chaitanyac22/car-price-prediction-model-for-an-automobile-consulting-company
The goal of this project is to build multiple linear regression models for the prediction of car prices.
business-analytics data-analytics data-cleaning data-manipulation data-visualization exploratory-data-analysis feature-engineering machine-learning model-building model-evaluation prediction-model python3 residual-analysis statistics
Last synced: 13 Apr 2025
https://github.com/siddeshsambasivam/ntuoss-datascraping-and-datacleaning-workshop
This repository contains the reference scripts and the content presented in the NTU OSS Data scraping and Data cleaning workshop.
data-cleaning data-crawling data-scraping
Last synced: 12 May 2025
https://github.com/memgonzales/pisa-2018-analysis
Jupyter notebook presenting the process of data preparation, research question formulation, data analysis, and data modeling with the goal of extracting insights from the 2018 PISA Dataset
data-cleaning data-modeling data-science data-visualization exploratory-data-analysis jupyter-notebook matplotlib numpy oecd-data pandas pisa scipy statistical-inference
Last synced: 13 Jun 2025
https://github.com/erictleung/tutorial-tidyverse
:milky_way: Presentation on the tidyverse in R to clean and manipulate data
data-cleaning data-manipulation data-science manipulate-data presentation programming r tidyverse tutorial
Last synced: 25 Mar 2025
https://github.com/erictleung/data-science
:computer: Repository for teaching materials and notes on machine learning and data science for freeCodeCamp
data-cleaning data-engineering data-science data-visualization freecodecamp learning machine-learning mathematics notes python statistics
Last synced: 25 Mar 2025
https://github.com/chaitanyac22/lending-club-project---data-analysis-for-a-consumer-finance-company
Lending Club is a consumer finance company that specializes in lending various types of loans to urban customers. When the company receives a loan application, the company has to make a decision for loan approval based on the applicant’s profile. The project work aims to help the company in understanding the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default. The company can utilize this knowledge for its portfolio and risk assessment.
banking business-intelligence data-analysis data-cleaning data-manipulation data-visualization exploratory-data-analysis feature-engineering finance portfolio-management python3 risk-assessment statistics
Last synced: 23 Aug 2025
https://github.com/hrolive/from-data-to-insights-with-google-cloud-platform
Four-course accelerated online specialization teaches course participants how to derive insights through data analysis and visualization using the Google Cloud Platform
data-analysis data-cleaning data-preparation data-visualization sql
Last synced: 12 May 2025
https://github.com/vishnu-t-r/data-analytics-portfolio-projects
This repository contain data analyst portfolio projects developed using various data analytics tools including SQL, Python, Tableau, Looker etc.
data data-analysis data-cleaning data-modeling data-visualization looker looker-studio python sql ssms tableau
Last synced: 23 Apr 2025
https://github.com/data-cleaning/validatesuggest
Generate validation rules from data
Last synced: 22 Oct 2025
https://github.com/kwokhing/network-analysis-on-mrt-station
Demo on applying the concept of network analysis on a network of connected railway stations, attempting to identify the important stations (nodes) in this network. Web scraping techniques using rvest package is also briefly discussed upon.
betweenness-centrality closeness-centrality data-cleaning degree-centrality eigenvector-centrality gephi graph-analysis igraph r rvest social-network-analysis social-networks web-scraping xpath
Last synced: 13 Oct 2025