https://github.com/ashithapallath/feature-engineering
This repository contains a range of examples and techniques for feature engineering, aimed at improving dataset quality and boosting model performance. It covers essential methods such as Exploratory Data Analysis (EDA) and Interquartile Range (IQR) analysis for detecting and handling outliers.
https://github.com/ashithapallath/feature-engineering
exploratory-data-analysis feature-engineering iqr-method matplotlib numpy outlier-detection pandas python seaborn
Last synced: 11 months ago
JSON representation
This repository contains a range of examples and techniques for feature engineering, aimed at improving dataset quality and boosting model performance. It covers essential methods such as Exploratory Data Analysis (EDA) and Interquartile Range (IQR) analysis for detecting and handling outliers.
- Host: GitHub
- URL: https://github.com/ashithapallath/feature-engineering
- Owner: ashithapallath
- Created: 2024-01-31T10:08:33.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2025-01-14T05:16:21.000Z (about 1 year ago)
- Last Synced: 2025-01-27T10:12:41.683Z (about 1 year ago)
- Topics: exploratory-data-analysis, feature-engineering, iqr-method, matplotlib, numpy, outlier-detection, pandas, python, seaborn
- Language: Jupyter Notebook
- Homepage:
- Size: 1 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Feature Engineering
This repository contains examples and techniques for feature engineering, focusing on improving dataset quality and enhancing model performance. It covers critical aspects such as **Exploratory Data Analysis (EDA)** and **Interquartile Range (IQR) analysis** for outlier detection and handling.
## Features
This repository includes:
- **Exploratory Data Analysis (EDA)**:
- Understanding data distribution.
- Summary statistics and visualizations.
- Insights into data trends and anomalies.
- **Outlier Detection using IQR**:
- Identification of outliers based on the interquartile range.
- Strategies for outlier handling (e.g., capping, removal).
- **Feature Engineering Techniques**:
- Handling missing values.
- Data normalization and scaling.
- Feature transformation and encoding.
## Prerequisites
Ensure you have the following installed:
- Python 3.8+
- Required libraries:
- NumPy
- Pandas
- Matplotlib
- Seaborn
Install dependencies using:
```bash
pip install numpy pandas matplotlib seaborn
```
## How to Use
1. Clone the repository:
```bash
git clone https://github.com/ashithapallath/Feature-Engineering.git
cd Feature-Engineering
```
2. Explore the Jupyter Notebooks (`*.ipynb`):
- Notebooks include step-by-step explanations and implementations.
3. Run the notebooks using:
```bash
jupyter notebook
```
4. Follow the instructions in each notebook to reproduce the analyses and techniques.
## Techniques Overview
### **Exploratory Data Analysis (EDA)**
- Summarizing data using:
- Descriptive statistics (mean, median, standard deviation, etc.).
- Data visualizations (histograms, box plots, scatter plots).
- Identifying patterns, trends, and anomalies in the data.
### **IQR-Based Outlier Detection**
- Calculation of the interquartile range:
```python
Q1 = data['column'].quantile(0.25)
Q3 = data['column'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = data[(data['column'] < lower_bound) | (data['column'] > upper_bound)]
```
- Options for handling outliers:
- Removing rows with outliers.
- Capping values at lower and upper bounds.
---
## Contribution
Contributions are welcome!
1. Fork the repository.
2. Create a branch for your feature or fix.
3. Submit a pull request with a description of your changes.
## License
This project is licensed under the MIT License.
## Acknowledgments
Special thanks to the open-source community for providing the tools and libraries that made this repository possible.