https://github.com/jaypanchal9/spotless-data
Spotless Data: A Python-based workflow using Jupyter Notebooks for efficient data cleaning, preprocessing, handling missing values, correcting outliers, and integrating external datasets ideal for quick, reliable, and clean data preparation.
https://github.com/jaypanchal9/spotless-data
data-cleaning data-preprocessing data-wrangling matplotlib numpy pandas python3
Last synced: 30 days ago
JSON representation
Spotless Data: A Python-based workflow using Jupyter Notebooks for efficient data cleaning, preprocessing, handling missing values, correcting outliers, and integrating external datasets ideal for quick, reliable, and clean data preparation.
- Host: GitHub
- URL: https://github.com/jaypanchal9/spotless-data
- Owner: jaypanchal9
- License: gpl-3.0
- Created: 2025-03-07T18:18:31.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-03-07T18:28:11.000Z (11 months ago)
- Last Synced: 2025-03-07T19:25:38.083Z (11 months ago)
- Topics: data-cleaning, data-preprocessing, data-wrangling, matplotlib, numpy, pandas, python3
- Language: Jupyter Notebook
- Homepage:
- Size: 1020 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Spotless Data
SpotlessData is a structured repository for performing efficient data cleaning and preprocessing using Python and Jupyter Notebooks. It includes tasks designed to simplify the process of preparing datasets for analysis by identifying and correcting issues such as missing values, inconsistencies, and outliers.
## Project Overview
This repository contains two Jupyter Notebooks, each targeting specific aspects of data cleaning and preprocessing:
### Task 1: **Data Cleaning and Preprocessing**
- **Purpose:** Advanced data cleaning and outlier detection and correction.
- **Key Components:**
- Mounting Google Drive to access datasets.
- Identification, analysis, and treatment of outliers.
- Libraries utilized: `pandas`, `numpy`, and additional supporting libraries.
### Task 2: **Data Loading and Cleaning Workflow**
- **Purpose:** Fundamental data loading procedures and initial data cleaning.
- **Key Components:**
- Mounting Google Drive for dataset loading.
- Basic operations for cleaning datasets, including handling missing data.
- Libraries utilized: `pandas`, `numpy`, and additional supporting libraries.
## Getting Started
### Prerequisites
Ensure you have the following installed:
- Python 3.x
- Jupyter Notebook
- Essential Python libraries:
- `pandas`
- `numpy`
- `matplotlib` (optional for visualizations)
### Installation
Clone the repository and set up the environment:
```bash
git clone
cd
pip install -r requirements.txt
```
### Usage
- Open Jupyter Notebook or a compatible environment such as Google Colab.
- Execute notebooks sequentially by following the provided instructions within each notebook.
## Repository Structure
```
.
├── notebooks/
│ ├── Data_Cleaning_and_Preprocessing.ipynb
│ └── Data_Loading_and_Cleaning_Workflow.ipynb
├── data/
│ ├── Group010_dirty_data_solution.csv
│ ├── Group010_missing_data_solution.csv
│ ├── Group010_outlier_data_solution.csv
│ ├── suburb_info.xlsx
│ └── warehouses.xlsx
├── requirements.txt
└── README.md
```
## Authors
- Jay Panchal
- Abhishek Adhikary
## License
This project is licensed under the GNU General Public License v3.0. See the [LICENSE](LICENSE) file for details.
## Acknowledgments
- Python Official Documentation
- Contributors and maintainers of utilized open-source libraries