https://github.com/sunnyrao07/water-quality-analysis

A machine learning project that predicts water potability based on chemical and physical attributes, using models like Logistic Regression, Random Forest, and XGBoost.
https://github.com/sunnyrao07/water-quality-analysis

data-cleaning label-encoding logistic-regression matplotlib model-evaluation numpy pandas pyhton random-forest sckiit-learn seaborn smote standard-scaler xgboost

Last synced: 20 days ago
JSON representation

A machine learning project that predicts water potability based on chemical and physical attributes, using models like Logistic Regression, Random Forest, and XGBoost.

Host: GitHub
URL: https://github.com/sunnyrao07/water-quality-analysis
Owner: SunnyRao07
Created: 2025-04-13T13:55:53.000Z (22 days ago)
Default Branch: main
Last Pushed: 2025-04-13T14:05:34.000Z (22 days ago)
Last Synced: 2025-04-13T14:41:04.595Z (22 days ago)
Topics: data-cleaning, label-encoding, logistic-regression, matplotlib, model-evaluation, numpy, pandas, pyhton, random-forest, sckiit-learn, seaborn, smote, standard-scaler, xgboost
Language: Jupyter Notebook
Homepage:
Size: 0 Bytes
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# 💧 Water Potability Analysis

## 📌 Project Overview
Water quality is a pressing global issue, and access to safe drinking water is essential for health and sustainability. This project aims to **predict the potability of water** samples using **machine learning models** based on chemical and physical attributes.

Three classification models were implemented and evaluated:
- **Logistic Regression**
- **Random Forest Classifier**
- **XGBoost Classifier** (🏆 Best Performing)

---

## 📂 Project Resources
🔹 **Dataset (CSV File)**: [Download water_potability.csv](https://github.com/SunnyRao07/Water-Quality-Analysis/blob/main/water_potability.csv)
🔹 **Project Code (.ipynb)**: [View Jupyter Notebook](https://github.com/SunnyRao07/Water-Quality-Analysis/blob/main/Water_Quality_Analysis.ipynb)
🔹 **Project Report (DOCX File)**: [Download Report](https://github.com/SunnyRao07/Water-Quality-Analysis/blob/main/Water%20Quality%20Analysis.docx)

---

## 🧾 Dataset Overview
- **Total Records**: 3,276
- **Target Variable**: `Potability` (1 = Safe to drink, 0 = Unsafe)
- **Features**:
- `ph`
- `Hardness`
- `Solids`
- `Chloramines`
- `Sulfate`
- `Conductivity`
- `Organic_carbon`
- `Trihalomethanes`
- `Turbidity`

📝 **Class Imbalance Notice**: Only ~39.8% of the samples are potable, requiring resampling strategies like **SMOTE**.

---

## 🛠 Data Preprocessing
✔️ **Handling Missing Values**
- Imputed missing values in `ph`, `Sulfate`, and `Trihalomethanes` using **median imputation**

✔️ **Feature Scaling**
- Standardized numerical columns using **StandardScaler**

✔️ **Class Imbalance Handling**
- Applied **SMOTE (Synthetic Minority Over-sampling Technique)** to balance potable and non-potable classes

✔️ **Outlier Detection**
- Identified and handled outliers using visualizations and statistical techniques

---

## 📊 Exploratory Data Analysis
- **Visualization Tools**: Histograms, box plots, scatter plots, heatmaps
- **Key Insights**:
- `Turbidity`, `pH`, and `Conductivity` have the most influence on potability
- `Solids` and `Conductivity` are highly correlated
- Certain features have skewed distributions and outliers

---

## 🤖 Models Used

### 1️⃣ Logistic Regression
✔️ Simple and interpretable baseline
❌ Less accurate on complex patterns

### 2️⃣ Random Forest Classifier
✅ Robust and interpretable
✅ Handles non-linear relationships well
❌ Slower on large datasets

### 3️⃣ XGBoost Classifier (🏆 Best Model)
✅ High accuracy and performance
✅ Efficient with imbalanced datasets
✅ Strong generalization and feature importance ranking

---

## 📈 Model Performance

| Model | Accuracy | Precision | Recall | F1 Score |
|--------------------|----------|-----------|--------|----------|
| Logistic Regression| ~63.5% | ~59.3% | ~49.7% | ~53.9% |
| Random Forest | ~74.2% | ~70.1% | ~67.5% | ~68.8% |
| **XGBoost** | **78.6%** | **75.3%** | **73.8%** | **74.5%** |

📌 **XGBoost** performed best across all evaluation metrics.

---

## 🔍 Feature Importance (XGBoost)
Top contributors to potability prediction:
- `Turbidity` 🥇
- `pH`
- `Conductivity`
- `Chloramines`
- `Trihalomethanes`

---

## 🚀 Future Scope
🔹 **Integrate IoT Sensors** for real-time data collection
🔹 **Add microbial and temperature data** for improved predictions
🔹 **Explore deep learning models** for more complex patterns
🔹 **Deploy as a web app** for public and government use

---

## 🏗 Project Contributions
This project includes the following key components:
✔️ Exploratory Data Analysis
✔️ Data Cleaning & Preprocessing
✔️ Model Training, Tuning, and Evaluation
✔️ Report and Code Documentation

---

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sunnyrao07/water-quality-analysis

Awesome Lists containing this project

README