https://github.com/nickssilver/irismlpipeline
An example of a machine learning pipeline using the Iris dataset
https://github.com/nickssilver/irismlpipeline
Last synced: 10 months ago
JSON representation
An example of a machine learning pipeline using the Iris dataset
- Host: GitHub
- URL: https://github.com/nickssilver/irismlpipeline
- Owner: nickssilver
- License: mit
- Created: 2023-04-19T20:52:27.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-04-19T21:14:42.000Z (over 2 years ago)
- Last Synced: 2025-01-22T03:17:43.401Z (12 months ago)
- Language: Python
- Size: 9.77 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Machine Learning Pipeline Example
This repository demonstrates the creation of a machine learning pipeline using scikit-learn with the Random Forest Regression algorithm. The pipeline includes data preprocessing, model training, and evaluation. The sources used for creating this example are as follows:
- [turing.com: Building an ML Pipeline in Python with scikit-learn](https://www.turing.com/kb/building-ml-pipeline-in-python-with-scikit-learn)
- [freecodecamp.org: Machine Learning Pipeline](https://www.freecodecamp.org/news/machine-learning-pipeline/)
- [towardsdatascience.com: Building a Machine Learning Pipeline](https://towardsdatascience.com/building-a-machine-learning-pipeline-3bba20c2352b)
- [analyticsvidhya.com: Build your first Machine Learning Pipeline using scikit-learn](https://www.analyticsvidhya.com/blog/2020/01/build-your-first-machine-learning-pipeline-using-scikit-learn/)
## Getting Started
1. Clone this repository to your local machine.
2. Install the required packages:
pip install scikit-learn pandas numpy
Run the pipeline script:
python src/pipeline.py
## Pipeline Design
The pipeline is designed in three stages:
Data preprocessing: The dataset is cleaned by dropping unnecessary columns, filling missing values, and encoding categorical features.
Model training: The preprocessed data is split into training and testing sets, and a Random Forest Regression model is trained using the training set.
Model evaluation: The trained model is evaluated on the testing set to measure its performance.
## Data Preprocessing
The data preprocessing stage includes the following steps:
- Dropping unused columns: df.drop(['record_id', 'casual', 'registered', 'datetime', 'temp'], axis=1, inplace=True)
- Creating pipelines for numerical and categorical features using Pipeline(steps=[('step name', transform function), …])
- Filling missing values with SimpleImputer
- Scaling numerical features with MinMaxScaler
- Encoding categorical features with OneHotEncoder(handle_unknown='ignore')
## Model Training
In this stage, the preprocessed data is split into training and testing sets, and a Random Forest Regression model is trained using the training set. The pipeline is built using Pipeline(steps=[('scaler', StandardScaler()), ('classifier', RandomForestClassifier())]), and the model is trained with the fit() method.
Model Evaluation
The trained model is evaluated on the testing set using accuracy_score and balanced_accuracy_score from scikit-learn's metrics module. The results are printed to the console.
### Authors
[Nicks M. Gitobu, Software Engineer](https://www.linkedin.com/in/nicholas-gitobu-973b081b9/)
License
This project is licensed under the **MIT License**.