An open API service indexing awesome lists of open source software.

https://github.com/jaypanchal9/spotless-data

Spotless Data: A Python-based workflow using Jupyter Notebooks for efficient data cleaning, preprocessing, handling missing values, correcting outliers, and integrating external datasets ideal for quick, reliable, and clean data preparation.
https://github.com/jaypanchal9/spotless-data

data-cleaning data-preprocessing data-wrangling matplotlib numpy pandas python3

Last synced: 30 days ago
JSON representation

Spotless Data: A Python-based workflow using Jupyter Notebooks for efficient data cleaning, preprocessing, handling missing values, correcting outliers, and integrating external datasets ideal for quick, reliable, and clean data preparation.

Awesome Lists containing this project

README

          

# Spotless Data

SpotlessData is a structured repository for performing efficient data cleaning and preprocessing using Python and Jupyter Notebooks. It includes tasks designed to simplify the process of preparing datasets for analysis by identifying and correcting issues such as missing values, inconsistencies, and outliers.

## Project Overview
This repository contains two Jupyter Notebooks, each targeting specific aspects of data cleaning and preprocessing:

### Task 1: **Data Cleaning and Preprocessing**
- **Purpose:** Advanced data cleaning and outlier detection and correction.
- **Key Components:**
- Mounting Google Drive to access datasets.
- Identification, analysis, and treatment of outliers.
- Libraries utilized: `pandas`, `numpy`, and additional supporting libraries.

### Task 2: **Data Loading and Cleaning Workflow**
- **Purpose:** Fundamental data loading procedures and initial data cleaning.
- **Key Components:**
- Mounting Google Drive for dataset loading.
- Basic operations for cleaning datasets, including handling missing data.
- Libraries utilized: `pandas`, `numpy`, and additional supporting libraries.

## Getting Started

### Prerequisites
Ensure you have the following installed:
- Python 3.x
- Jupyter Notebook
- Essential Python libraries:
- `pandas`
- `numpy`
- `matplotlib` (optional for visualizations)

### Installation
Clone the repository and set up the environment:

```bash
git clone
cd
pip install -r requirements.txt
```

### Usage
- Open Jupyter Notebook or a compatible environment such as Google Colab.
- Execute notebooks sequentially by following the provided instructions within each notebook.

## Repository Structure
```
.
├── notebooks/
│ ├── Data_Cleaning_and_Preprocessing.ipynb
│ └── Data_Loading_and_Cleaning_Workflow.ipynb
├── data/
│ ├── Group010_dirty_data_solution.csv
│ ├── Group010_missing_data_solution.csv
│ ├── Group010_outlier_data_solution.csv
│ ├── suburb_info.xlsx
│ └── warehouses.xlsx
├── requirements.txt
└── README.md
```

## Authors
- Jay Panchal
- Abhishek Adhikary

## License
This project is licensed under the GNU General Public License v3.0. See the [LICENSE](LICENSE) file for details.

## Acknowledgments
- Python Official Documentation
- Contributors and maintainers of utilized open-source libraries