https://github.com/chanmeng666/water-quality-testing-data-analysis
Statistical analysis and predictive modeling of water quality parameters using Python, pandas, scikit-learn, and statsmodels
https://github.com/chanmeng666/water-quality-testing-data-analysis
data-analysis data-science data-visualization environmental-monitoring jupyter-notebook machine-learning pandas python scikit-learn seaborn statistics water-quality
Last synced: about 2 months ago
JSON representation
Statistical analysis and predictive modeling of water quality parameters using Python, pandas, scikit-learn, and statsmodels
- Host: GitHub
- URL: https://github.com/chanmeng666/water-quality-testing-data-analysis
- Owner: ChanMeng666
- License: mit
- Created: 2024-05-29T21:56:03.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2026-04-07T12:17:16.000Z (about 2 months ago)
- Last Synced: 2026-04-07T12:23:31.946Z (about 2 months ago)
- Topics: data-analysis, data-science, data-visualization, environmental-monitoring, jupyter-notebook, machine-learning, pandas, python, scikit-learn, seaborn, statistics, water-quality
- Language: Jupyter Notebook
- Homepage: https://github.com/ChanMeng666/water-quality-testing-data-analysis/blob/main/notebooks/water_quality_analysis.ipynb
- Size: 1.3 MB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# Water Quality Testing Data Analysis
Statistical analysis and predictive modeling of water quality parameters using Python.
## Overview
This project analyzes a dataset of 500 water samples across five quality parameters to explore relationships between water quality indicators and build predictive models for conductivity.
**Key findings:**
- Strong positive correlation (r = 0.705) between pH and dissolved oxygen levels
- Multi-parameter linear regression model predicts conductivity from pH, temperature, turbidity, and dissolved oxygen
- OLS regression confirms statistically significant relationships between several parameter pairs (p < 0.05)
## Dataset
The dataset (`data/water_quality_testing.csv`) contains 500 samples with the following parameters:
| Parameter | Unit | Range |
|---|---|---|
| pH | pH units | 6.83 - 7.48 |
| Temperature | °C | 20.3 - 23.6 |
| Turbidity | NTU | 3.1 - 5.1 |
| Dissolved Oxygen | mg/L | 6.0 - 9.9 |
| Conductivity | µS/cm | 316 - 370 |
## Project Structure
```
water-quality-testing-data-analysis/
├── data/
│ └── water_quality_testing.csv # Water quality dataset (500 samples)
├── notebooks/
│ └── water_quality_analysis.ipynb # Main analysis notebook
├── .gitignore
├── CODE_OF_CONDUCT.md
├── LICENSE
├── README.md
└── requirements.txt
```
## Getting Started
### Prerequisites
- Python 3.8+
- pip
### Installation
```bash
git clone https://github.com/ChanMeng666/water-quality-testing-data-analysis.git
cd water-quality-testing-data-analysis
pip install -r requirements.txt
```
### Usage
```bash
jupyter notebook notebooks/water_quality_analysis.ipynb
```
Run all cells (Kernel > Restart & Run All) to reproduce the full analysis.
## Analysis Contents
The notebook covers the following topics:
1. **Data Loading and Inspection** - Load dataset, examine structure and summary statistics
2. **Distribution Analysis** - Histograms with KDE for all parameters
3. **Correlation Analysis** - Correlation matrix heatmap and pair plots
4. **pH vs Dissolved Oxygen** - Deep dive into the strongest correlation
5. **Parameter Relationships** - Regression plots for multiple parameter pairs
6. **Predictive Modeling** - Linear regression for conductivity prediction (two-feature and multi-parameter models)
7. **Statistical Modeling (OLS)** - Ordinary least squares regression with statsmodels for statistical inference
8. **Conclusions** - Summary of key findings
## Built With
- [pandas](https://pandas.pydata.org/) - Data manipulation and analysis
- [NumPy](https://numpy.org/) - Numerical computing
- [Matplotlib](https://matplotlib.org/) - Static plotting and visualization
- [seaborn](https://seaborn.pydata.org/) - Statistical data visualization
- [scikit-learn](https://scikit-learn.org/) - Linear regression modeling
- [statsmodels](https://www.statsmodels.org/) - OLS regression and statistical testing
## License
This project is licensed under the MIT License. See [LICENSE](LICENSE) for details.
## Author
**Chan Meng** - [GitHub](https://github.com/ChanMeng666) · [LinkedIn](https://www.linkedin.com/in/chanmeng666/) · [Website](https://chanmeng.org/)