https://github.com/sunnyrao07/water-quality-analysis
A machine learning project that predicts water potability based on chemical and physical attributes, using models like Logistic Regression, Random Forest, and XGBoost.
https://github.com/sunnyrao07/water-quality-analysis
data-cleaning label-encoding logistic-regression matplotlib model-evaluation numpy pandas pyhton random-forest sckiit-learn seaborn smote standard-scaler xgboost
Last synced: 20 days ago
JSON representation
A machine learning project that predicts water potability based on chemical and physical attributes, using models like Logistic Regression, Random Forest, and XGBoost.
- Host: GitHub
- URL: https://github.com/sunnyrao07/water-quality-analysis
- Owner: SunnyRao07
- Created: 2025-04-13T13:55:53.000Z (22 days ago)
- Default Branch: main
- Last Pushed: 2025-04-13T14:05:34.000Z (22 days ago)
- Last Synced: 2025-04-13T14:41:04.595Z (22 days ago)
- Topics: data-cleaning, label-encoding, logistic-regression, matplotlib, model-evaluation, numpy, pandas, pyhton, random-forest, sckiit-learn, seaborn, smote, standard-scaler, xgboost
- Language: Jupyter Notebook
- Homepage:
- Size: 0 Bytes
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ๐ง Water Potability Analysis
## ๐ Project Overview
Water quality is a pressing global issue, and access to safe drinking water is essential for health and sustainability. This project aims to **predict the potability of water** samples using **machine learning models** based on chemical and physical attributes.Three classification models were implemented and evaluated:
- **Logistic Regression**
- **Random Forest Classifier**
- **XGBoost Classifier** (๐ Best Performing)---
## ๐ Project Resources
๐น **Dataset (CSV File)**: [Download water_potability.csv](https://github.com/SunnyRao07/Water-Quality-Analysis/blob/main/water_potability.csv)
๐น **Project Code (.ipynb)**: [View Jupyter Notebook](https://github.com/SunnyRao07/Water-Quality-Analysis/blob/main/Water_Quality_Analysis.ipynb)
๐น **Project Report (DOCX File)**: [Download Report](https://github.com/SunnyRao07/Water-Quality-Analysis/blob/main/Water%20Quality%20Analysis.docx)---
## ๐งพ Dataset Overview
- **Total Records**: 3,276
- **Target Variable**: `Potability` (1 = Safe to drink, 0 = Unsafe)
- **Features**:
- `ph`
- `Hardness`
- `Solids`
- `Chloramines`
- `Sulfate`
- `Conductivity`
- `Organic_carbon`
- `Trihalomethanes`
- `Turbidity`๐ **Class Imbalance Notice**: Only ~39.8% of the samples are potable, requiring resampling strategies like **SMOTE**.
---
## ๐ Data Preprocessing
โ๏ธ **Handling Missing Values**
- Imputed missing values in `ph`, `Sulfate`, and `Trihalomethanes` using **median imputation**โ๏ธ **Feature Scaling**
- Standardized numerical columns using **StandardScaler**โ๏ธ **Class Imbalance Handling**
- Applied **SMOTE (Synthetic Minority Over-sampling Technique)** to balance potable and non-potable classesโ๏ธ **Outlier Detection**
- Identified and handled outliers using visualizations and statistical techniques---
## ๐ Exploratory Data Analysis
- **Visualization Tools**: Histograms, box plots, scatter plots, heatmaps
- **Key Insights**:
- `Turbidity`, `pH`, and `Conductivity` have the most influence on potability
- `Solids` and `Conductivity` are highly correlated
- Certain features have skewed distributions and outliers---
## ๐ค Models Used
### 1๏ธโฃ Logistic Regression
โ๏ธ Simple and interpretable baseline
โ Less accurate on complex patterns### 2๏ธโฃ Random Forest Classifier
โ Robust and interpretable
โ Handles non-linear relationships well
โ Slower on large datasets### 3๏ธโฃ XGBoost Classifier (๐ Best Model)
โ High accuracy and performance
โ Efficient with imbalanced datasets
โ Strong generalization and feature importance ranking---
## ๐ Model Performance
| Model | Accuracy | Precision | Recall | F1 Score |
|--------------------|----------|-----------|--------|----------|
| Logistic Regression| ~63.5% | ~59.3% | ~49.7% | ~53.9% |
| Random Forest | ~74.2% | ~70.1% | ~67.5% | ~68.8% |
| **XGBoost** | **78.6%** | **75.3%** | **73.8%** | **74.5%** |๐ **XGBoost** performed best across all evaluation metrics.
---
## ๐ Feature Importance (XGBoost)
Top contributors to potability prediction:
- `Turbidity` ๐ฅ
- `pH`
- `Conductivity`
- `Chloramines`
- `Trihalomethanes`---
## ๐ Future Scope
๐น **Integrate IoT Sensors** for real-time data collection
๐น **Add microbial and temperature data** for improved predictions
๐น **Explore deep learning models** for more complex patterns
๐น **Deploy as a web app** for public and government use---
## ๐ Project Contributions
This project includes the following key components:
โ๏ธ Exploratory Data Analysis
โ๏ธ Data Cleaning & Preprocessing
โ๏ธ Model Training, Tuning, and Evaluation
โ๏ธ Report and Code Documentation---