An open API service indexing awesome lists of open source software.

https://github.com/sunnyrao07/water-quality-analysis

A machine learning project that predicts water potability based on chemical and physical attributes, using models like Logistic Regression, Random Forest, and XGBoost.
https://github.com/sunnyrao07/water-quality-analysis

data-cleaning label-encoding logistic-regression matplotlib model-evaluation numpy pandas pyhton random-forest sckiit-learn seaborn smote standard-scaler xgboost

Last synced: 20 days ago
JSON representation

A machine learning project that predicts water potability based on chemical and physical attributes, using models like Logistic Regression, Random Forest, and XGBoost.

Awesome Lists containing this project

README

        

# ๐Ÿ’ง Water Potability Analysis

## ๐Ÿ“Œ Project Overview
Water quality is a pressing global issue, and access to safe drinking water is essential for health and sustainability. This project aims to **predict the potability of water** samples using **machine learning models** based on chemical and physical attributes.

Three classification models were implemented and evaluated:
- **Logistic Regression**
- **Random Forest Classifier**
- **XGBoost Classifier** (๐Ÿ† Best Performing)

---

## ๐Ÿ“‚ Project Resources
๐Ÿ”น **Dataset (CSV File)**: [Download water_potability.csv](https://github.com/SunnyRao07/Water-Quality-Analysis/blob/main/water_potability.csv)
๐Ÿ”น **Project Code (.ipynb)**: [View Jupyter Notebook](https://github.com/SunnyRao07/Water-Quality-Analysis/blob/main/Water_Quality_Analysis.ipynb)
๐Ÿ”น **Project Report (DOCX File)**: [Download Report](https://github.com/SunnyRao07/Water-Quality-Analysis/blob/main/Water%20Quality%20Analysis.docx)

---

## ๐Ÿงพ Dataset Overview
- **Total Records**: 3,276
- **Target Variable**: `Potability` (1 = Safe to drink, 0 = Unsafe)
- **Features**:
- `ph`
- `Hardness`
- `Solids`
- `Chloramines`
- `Sulfate`
- `Conductivity`
- `Organic_carbon`
- `Trihalomethanes`
- `Turbidity`

๐Ÿ“ **Class Imbalance Notice**: Only ~39.8% of the samples are potable, requiring resampling strategies like **SMOTE**.

---

## ๐Ÿ›  Data Preprocessing
โœ”๏ธ **Handling Missing Values**
- Imputed missing values in `ph`, `Sulfate`, and `Trihalomethanes` using **median imputation**

โœ”๏ธ **Feature Scaling**
- Standardized numerical columns using **StandardScaler**

โœ”๏ธ **Class Imbalance Handling**
- Applied **SMOTE (Synthetic Minority Over-sampling Technique)** to balance potable and non-potable classes

โœ”๏ธ **Outlier Detection**
- Identified and handled outliers using visualizations and statistical techniques

---

## ๐Ÿ“Š Exploratory Data Analysis
- **Visualization Tools**: Histograms, box plots, scatter plots, heatmaps
- **Key Insights**:
- `Turbidity`, `pH`, and `Conductivity` have the most influence on potability
- `Solids` and `Conductivity` are highly correlated
- Certain features have skewed distributions and outliers

---

## ๐Ÿค– Models Used

### 1๏ธโƒฃ Logistic Regression
โœ”๏ธ Simple and interpretable baseline
โŒ Less accurate on complex patterns

### 2๏ธโƒฃ Random Forest Classifier
โœ… Robust and interpretable
โœ… Handles non-linear relationships well
โŒ Slower on large datasets

### 3๏ธโƒฃ XGBoost Classifier (๐Ÿ† Best Model)
โœ… High accuracy and performance
โœ… Efficient with imbalanced datasets
โœ… Strong generalization and feature importance ranking

---

## ๐Ÿ“ˆ Model Performance

| Model | Accuracy | Precision | Recall | F1 Score |
|--------------------|----------|-----------|--------|----------|
| Logistic Regression| ~63.5% | ~59.3% | ~49.7% | ~53.9% |
| Random Forest | ~74.2% | ~70.1% | ~67.5% | ~68.8% |
| **XGBoost** | **78.6%** | **75.3%** | **73.8%** | **74.5%** |

๐Ÿ“Œ **XGBoost** performed best across all evaluation metrics.

---

## ๐Ÿ” Feature Importance (XGBoost)
Top contributors to potability prediction:
- `Turbidity` ๐Ÿฅ‡
- `pH`
- `Conductivity`
- `Chloramines`
- `Trihalomethanes`

---

## ๐Ÿš€ Future Scope
๐Ÿ”น **Integrate IoT Sensors** for real-time data collection
๐Ÿ”น **Add microbial and temperature data** for improved predictions
๐Ÿ”น **Explore deep learning models** for more complex patterns
๐Ÿ”น **Deploy as a web app** for public and government use

---

## ๐Ÿ— Project Contributions
This project includes the following key components:
โœ”๏ธ Exploratory Data Analysis
โœ”๏ธ Data Cleaning & Preprocessing
โœ”๏ธ Model Training, Tuning, and Evaluation
โœ”๏ธ Report and Code Documentation

---