https://github.com/leabrodyheine/water-pump-status-prediction

This project implements machine learning models to predict the status of water pumps in Tanzania using data from DrivenData's competition. The project includes preprocessing steps, model evaluation using cross-validation, and hyperparameter optimization with Optuna.
https://github.com/leabrodyheine/water-pump-status-prediction

argparse cross-validation gradient-boosting-classifier logistic-regression machine-learning multilayer-perceptron numpy optuna pandas random-forest-classifier scikit-learn

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/leabrodyheine/water-pump-status-prediction
Owner: leabrodyheine
Created: 2024-05-19T03:02:49.000Z (12 months ago)
Default Branch: main
Last Pushed: 2024-05-19T03:18:41.000Z (12 months ago)
Last Synced: 2025-01-11T23:13:32.500Z (4 months ago)
Topics: argparse, cross-validation, gradient-boosting-classifier, logistic-regression, machine-learning, multilayer-perceptron, numpy, optuna, pandas, random-forest-classifier, scikit-learn
Language: Python
Homepage:
Size: 9.77 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Water-Pump-Status-Prediction

## Libraries/Imports Used
- optuna
- scikitlearn
- pandas
- NumPy
- subprocess
- argparse
- os

## Project Structure
- **part1.py**: This script trains and evaluates various machine learning models with different preprocessing techniques using cross-validation.
- **part2.py**: This script performs hyperparameter optimization using Optuna.
- **TestAll.py**: This script automates the process of running `part1.py` with different combinations of preprocessing and model types.

## Usage

### Running Part 1
To evaluate the models with different preprocessing techniques and model types, you can manually run `part1.py` with the desired arguments. However, it's more efficient to use `TestAll.py` to automate this process.

**Example Command:**
python part1.py ` `

**Automate Part 1 with `TestAll.py`:**
The `TestAll.py` script loops through all the command-line argument options instead of manually writing the command in the terminal for each variation of a model’s training. Run this file in your IDE.

### Running Part 2
For hyperparameter optimization, run the `part2.py` script. It uses Optuna to find the best hyperparameters for different classifiers.

## Data
The dataset for this project is sourced from the DrivenData competition Pump it Up: Data Mining the Water Table. It consists of training data (input features and labels) and test data (input features only).

## Preprocessing
Various preprocessing steps are implemented, including handling categorical features, dealing with missing values, scaling numerical values, and dealing with datetime features. For categorical features, I use three types of encoding: O
- OneHotEncoder
- OrdinalEncoder
- TargetEncoder

For numerical features, I consider two options: no scaling and StandardScaler.

## Machine Learning Models
I evaluate the performance of five families of machine learning models:
- Logistic Regression
- Random Forest Classifier
- Gradient Boosting Classifier
- Histogram-based Gradient Boosting Classifier
- Multi-layer Perceptron Classifier

## Hyperparameter Optimization
The project uses Optuna for hyperparameter optimization. The optimization process includes defining the configuration space and evaluating the performance using cross-validation.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/leabrodyheine/water-pump-status-prediction

Awesome Lists containing this project

README