{"id":25065673,"url":"https://github.com/programming-sai/water-quality-analysis-ml","last_synced_at":"2026-05-06T10:32:34.099Z","repository":{"id":270287731,"uuid":"909899119","full_name":"Programming-Sai/Water-Quality-Analysis-ML","owner":"Programming-Sai","description":"A ML-powered system to analyze and predict water quality from a dataset of water attributes.","archived":false,"fork":false,"pushed_at":"2025-01-27T21:19:41.000Z","size":4563,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-06T19:49:00.321Z","etag":null,"topics":["group","jupyter-notebook","ml","project-work","streamlit"],"latest_commit_sha":null,"homepage":"https://water-quality-analysis-ml.streamlit.app/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Programming-Sai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-30T02:31:15.000Z","updated_at":"2025-01-27T21:19:44.000Z","dependencies_parsed_at":"2024-12-30T03:23:59.724Z","dependency_job_id":"a298dfb6-cd0c-44ea-a425-53ce81d29e38","html_url":"https://github.com/Programming-Sai/Water-Quality-Analysis-ML","commit_stats":null,"previous_names":["programming-sai/water-quality-analysis-ml"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Programming-Sai%2FWater-Quality-Analysis-ML","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Programming-Sai%2FWater-Quality-Analysis-ML/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Programming-Sai%2FWater-Quality-Analysis-ML/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Programming-Sai%2FWater-Quality-Analysis-ML/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Programming-Sai","download_url":"https://codeload.github.com/Programming-Sai/Water-Quality-Analysis-ML/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246473299,"owners_count":20783244,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["group","jupyter-notebook","ml","project-work","streamlit"],"created_at":"2025-02-06T19:44:43.960Z","updated_at":"2026-05-06T10:32:29.061Z","avatar_url":"https://github.com/Programming-Sai.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Water-Quality-Analysis-ML\n\n---\n\n## Project Overview\n\nThis project aims to predict water quality using a dataset of various physicochemical, socio-economic, and environmental factors. By leveraging machine learning models, we classify water samples as **Clean** or **Dirty** based on their attributes. The final model is deployed using a **Streamlit-based web app**, providing an interactive UI for predictions.\n\n![Deployed Application](./assets/home.png)\n\n**Deployed Application**: [Water Quality Analysis App](https://water-quality-analysis-ml.streamlit.app/)\n\n## Dataset\n\n- **Source:** [Kaggle - Water Quality Dataset](https://www.kaggle.com/datasets/ozgurdogan646/water-quality-dataset)\n- **Description:** The dataset contains measurements such as population density, waste management indices, development index, and other features used to assess water quality.\n\n## Goals\n\n1. Explore the dataset and perform data visualization.\n2. Preprocess the data by handling missing values, scaling features, and engineering new features:\n3. Train machine learning models to predict water quality.\n4. Evaluate and compare model performance.\n5. Deploy the final model using a **Streamlit-based web app**.\n\n### Features\n\n- **Population Density**: Estimated as the number of people per unit area in a given region.\n- **Waste Index**: A derived feature that measures waste composition against recycling rates.\n- **Development Index**: Calculated using GDP and literacy rate, reflecting socio-economic factors affecting water quality.\n\n#### Data Preprocessing\n\nThe preprocessing steps include:\n\n1. Handling missing values and normalizing data where necessary.\n2. Creating derived features such as the **Waste Index** and **Development Index** using formulas:\n\n\u003cbr\u003e\n\n$$\nWaste\\ Index = \\frac{(Max\\ Waste\\ Composition + Other\\ Composition)}{Recycling\\ Percentage}\n$$\n\n\u003cbr\u003e\n\n$$\nDevelopment\\ Index = \\text{GDP} \\times \\text{Literacy Rate}\n$$\n\n\u003cbr\u003e\n\n3. Dropping unnecessary columns like 'Country', 'GDP', and 'TouristMean' to streamline the data.\n4. Feature engineering to ensure all relevant attributes are used for training.\n\n---\n\n## Model Documentation\n\n\u003cp align='center'\u003e\u003cb\u003eModel Diagrams\u003c/b\u003e\u003c/p\u003e\n\n\u003cbr\u003e\n\u003cp align='center'\u003e\n\u003cimg src='./assets/corr-map-for-entire-dataset.png' width='400' hspace=10 vspace=10\u003e\n\u003cimg src='./assets/correlation-per-preprocessed.png' width='400' hspace=10 vspace=10\u003e\n\u003cimg src='./assets/confusion-matrix-per-model.png' width='400' hspace=10 vspace=10\u003e\n\u003cimg src='./assets/feature-importance.png' width='400' hspace=10 vspace=10\u003e\n\u003cimg src='./assets/model-comparison.png' width='400' hspace=10 vspace=10\u003e\n\u003cimg src='./assets/roc-curve-per-model.png' width='400' hspace=10 vspace=10\u003e\n\n\u003cbr\u003e\n\u003c/p\u003e\n\u003cbr\u003e\n\n### Overview\n\nThe final model used in this project is an **XGBoost Classifier**, selected for its superior performance in handling tabular data with complex relationships. Below are the details about the model and its performance:\n\n---\n\n### **Model Selection Process**\n\n1. **Candidate Models**:\n\n   - AdaBoostClassifier\n   - RandomForestClassifier\n   - SVC (Support Vector Classifier)\n   - KNeighborsClassifier\n   - Naive Bayes Classifier (GaussianNB)\n   - **XGBClassifier** (eXtreme Gradient Boost Classifier)\n\n2. **Evaluation Criteria**:\n\n   - Accuracy\n   - Precision, Recall, and F1-Score\n   - Area Under the ROC Curve (AUC-ROC)\n   - Model interpretability and feature importance\n\n3. **Reason for Choosing XGBoost**:\n   - High predictive accuracy on tabular data.\n   - Handles missing values effectively.\n   - Built-in feature importance metrics.\n   - Optimized for speed and performance.\n\n---\n\n### **Hyperparameter Tuning**\n\nThe model's hyperparameters were tuned using **Grid Search** and **Cross-Validation** to optimize performance. Below are the final hyperparameter settings:\n\n- `learning_rate`: 0.1 --\u003e [0.01, 0.05, 0.1]\n- `max_depth`: 5 --\u003e [3, 5, 7]\n- `n_estimators`: 200 --\u003e [100, 200, 300]\n- `subsample`:0.8 --\u003e [0.8, 0.9, 1.0]\n\n---\n\n### **Training and Evaluation**\n\n1. **Training Dataset**:\n\n   - The dataset was split into **80% training** and **20% testing** sets.\n   - Cross-validation (5-fold) was used to validate model performance during training.\n\n2. **Performance Metrics**:\n\n   - **Precision**: 99.11%\n   - **Recall**: 49.01%\n   - **F1-Score**: 65.58%\n   - **Cross-Validation Score**: 67.05%\n\n3. **Confusion Matrix**:\n\n$$\\begin{bmatrix} 53 \u0026 26 \\\\\\ 3003 \u0026 2886 \\end{bmatrix}$$\n\n\u003cbr\u003e\n\u003cbr\u003e\n\n- **True Positive (TP)**: 53 (Predicted Clean and actually Clean)\n- **False Negative (FN)**: 26 (Predicted Dirty but actually Clean)\n- **False Positive (FP)**: 3003 (Predicted Clean but actually Dirty)\n- **True Negative (TN)**: 2886 (Predicted Dirty and actually Dirty)\n\n---\n\n### **Feature Importance**\n\nThe XGBoost model provides insights into feature importance based on the number of splits a feature contributes to the decision tree ensemble. Below are the top features influencing predictions:\n\n| Feature            | Importance (%) |\n| ------------------ | -------------- |\n| Development Index  | 59.78%         |\n| Waste Index        | 32.65%         |\n| Population Density | 7.55%          |\n\n---\n\n### **Model Limitations**\n\nWhile XGBoost performed best overall, the following limitations were observed:\n\n- Requires significant computational resources for training on large datasets.\n- May overfit on smaller datasets without proper regularization.\n\n---\n\n### **Future Improvements**\n\n1. Experiment with ensemble models combining XGBoost with other classifiers for improved performance.\n2. Explore more advanced hyperparameter tuning methods such as Bayesian Optimization.\n3. Investigate additional derived features to enhance predictive accuracy.\n\n---\n\n## Installation\n\n1. Clone the repository:\n\n   ```bash\n   git clone https://github.com/Programming-Sai/Water-Quality-Analysis-ML.git\n   cd Water-Quality-Analysis-ML\n   ```\n\n2. Set up a virtual environment:\n   ```bash\n   python -m venv venv\n   source venv/bin/activate  # On Windows: venv\\Scripts\\activate\n   ```\n3. Install dependencies:\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n## Folder Structure\n\n```\nwater-quality-prediction/\n│\n├── data/\n│   └── water_quality.csv        # The dataset\n│   └── processed_data.csv       # The Processed dataset\n│\n│\n├── notebooks/\n│   └── data_exploration.ipynb   # For data exploration and visualization\n│   └── model_building.ipynb     # For training the model\n|\n├── deployment-ui/\n|   └── data/\n│   |     └── fav.ico\n│   |     └── XGBoost_model.joblib\n|   |\n│   ├── app.py                   # Streamlit app for deployment\n│   └── assets/                  # Custom styles and images\n|\n│\n├── requirements.txt             # Python dependencies\n├── .gitignore                   # Python gitignore\n└── README.md                    # Project overview\n```\n\n## Usage\n\n1. Explore the dataset:\n   - Open and run `notebooks/data_exploration.ipynb` to understand the data and visualize distributions.\n2. Train the model:\n   - Use `notebooks/model_building.ipynb` or `src/train.py` to train and evaluate the machine learning models.\n3. Predict water quality:\n   - Use `src/predict.py` to make predictions on new data.\n4. Deploy the model:\n   - Run the Streamlit app locally:\n     ```bash\n     streamlit run deployment-ui/app.py\n     ```\n   - Alternatively, use the deployed version: [Water Quality Analysis App](https://water-quality-analysis-ml.streamlit.app/).\n\n## Key Libraries\n\n- **pandas**: Data manipulation\n- **numpy**: Numerical computations\n- **scikit-learn**: Machine learning models\n- **xgboost**: Gradient boosting for tabular data\n- **matplotlib \u0026 seaborn**: Data visualization\n- **Streamlit**: Model deployment and interactive UI\n\n---\n\n## Model Performance\n\nBelow are the performance metrics and comparison of models tested during development.\n\n**Accuracy Comparison**  \n(Insert image of accuracy comparison between different models here)\n\n**Confusion Matrix for XGBoost**  \n(Insert confusion matrix image for the final XGBoost model here)\n\n---\n\n## Contributors\n\n- [Programming-Sai](https://github.com/Programming-Sai)\n- [Pope-Addotey2004](https://github.com/Pope-Addotey2004)\n- [BrytSnow](https://github.com/BrytSnow)\n\n---\n\n### **Steps to Set It Up**\n\n1. Create a new repository on GitHub with the name `water-quality-prediction`.\n2. Initialize your project folder locally and link it to the GitHub repo:\n   ```bash\n   git init\n   git remote add origin https://github.com/Programming-Sai/Water-Quality-Analysis-ML.git\n   git branch -M main\n   git add .\n   git commit -m \"Initial commit\"\n   git push -u origin main\n   ```\n\n---\n\n## Application Features\n\n1. **Interactive UI**: Allows users to input values for Population Density, Waste Index, and Development Index.\n2. **Tooltips for Guidance**: Each input field includes descriptions and formulas to assist users in understanding the required values.\n3. **Real-Time Predictions**: Displays the prediction result (Clean or Dirty) with color-coded feedback.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprogramming-sai%2Fwater-quality-analysis-ml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fprogramming-sai%2Fwater-quality-analysis-ml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprogramming-sai%2Fwater-quality-analysis-ml/lists"}