Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/engjurado/exploratory-data-analysis-eda-for-data-science-and-ml

This repository contains materials for performing Exploratory Data Analysis (EDA) in data science and machine learning projects. Learn how to summarize, visualize, and clean data for regression and classification tasks, improving model performance through insights gained from EDA techniques.
https://github.com/engjurado/exploratory-data-analysis-eda-for-data-science-and-ml

Last synced: 13 days ago
JSON representation

This repository contains materials for performing Exploratory Data Analysis (EDA) in data science and machine learning projects. Learn how to summarize, visualize, and clean data for regression and classification tasks, improving model performance through insights gained from EDA techniques.

Awesome Lists containing this project

README

        

# Exploratory Data Analysis (EDA) for Data Science and ML

This project is part of a hands-on, beginner-friendly course created by IBM, available at [CognitiveClass.ai](https://cognitiveclass.ai/), aimed at teaching effective EDA techniques to enhance data science and machine learning projects.

## Overview

Exploratory Data Analysis (EDA) is one of the most crucial steps in any data science project. It involves inspecting, cleaning, transforming, and visualizing data to extract meaningful insights, which will, in turn, guide your data modeling and machine learning workflows. This repository offers a comprehensive guide to performing EDA with a focus on:

- **Data Summarization**: Understanding the main features of your dataset.
- **Visualization**: Plotting and visualizing trends, patterns, and outliers.
- **Correlation Analysis**: Identifying relationships between variables.
- **Handling Missing Data**: Techniques for managing incomplete datasets.
- **Outlier Detection**: Recognizing anomalies in the data.
- **Insights for Model Building**: Using EDA to inform decisions on feature selection, transformations, and model selection.

The repository focuses on EDA for both **regression** and **classification** problems, which are common in machine learning.

## What You'll Learn

- How to perform basic EDA to uncover key insights from your dataset.
- The use of various visualization techniques (e.g., histograms, scatter plots, heatmaps) to explore data patterns.
- Methods for handling missing data and identifying outliers.
- Insights to improve feature selection and model performance for regression and classification tasks.
- How to set up a solid foundation for machine learning models with EDA.

## Requirements

- Python 3.x
- Jupyter Notebook
- Required libraries:
- pandas
- NumPy
- SciPy
- matplotlib
- seaborn
- scikit-learn
- MissingNo
- fasteda

You can install the required libraries by running:

```bash
pip install pandas numpy matplotlib seaborn scikit-learn fasteda missingno scipy
```

## Files in This Repository

- `EDA lab.ipynb`: Jupyter Notebook containing the EDA process.
- `README.md`: Project overview and instructions (this file).

## How to Use

1. Clone the repository:
```bash
git clone https://github.com/EngJurado/Exploratory-Data-Analysis-EDA-for-Data-Science-and-ML.git
```
2. Navigate to the project directory:
```bash
cd xploratory-Data-Analysis-EDA-for-Data-Science-and-ML
```
3. Open the Jupyter Notebook to explore and run the EDA processes step by step:
```bash
jupyter notebook
```

## Resources

- **Course Website**: [Exploratory Data Analysis (EDA) for Data Science and ML](https://cognitiveclass.ai/)
- **Python Documentation**: [https://docs.python.org/3/](https://docs.python.org/3/)
- **Pandas Documentation**: [https://pandas.pydata.org/](https://pandas.pydata.org/)
- **Seaborn Documentation**: [https://seaborn.pydata.org/](https://seaborn.pydata.org/)

## License

This project is licensed under the MIT License - see the LICENSE file for details.