Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/labrijisaad/kedro-energy-forecasting-machine-learning-pipeline

This repo showcases a project that transforms ML model training into a simplified, production-ready Kedro Dockerized Pipeline. It emphasizes best MLOps practices, enabling easy training, evaluation, and deployment of models, including XGBoost, LightGBM and Random Forest, with built-in visualization and logging features for effective monitoring.
https://github.com/labrijisaad/kedro-energy-forecasting-machine-learning-pipeline

cicd docker docker-volume github-actions jupyter-notebook kedro kedro-docker lightgbm machine-learning-pipeline mlops mlops-workflow random-forest xgboost

Last synced: 9 days ago
JSON representation

This repo showcases a project that transforms ML model training into a simplified, production-ready Kedro Dockerized Pipeline. It emphasizes best MLOps practices, enabling easy training, evaluation, and deployment of models, including XGBoost, LightGBM and Random Forest, with built-in visualization and logging features for effective monitoring.

Awesome Lists containing this project

README

        

# `Kedro` Machine Learning Pipeline 🏯




"This DALL-E generated image, within Japan, Kedro orchestrates the rhythm of renewable insights amidst the choreography of data and predictions."

## πŸ“˜ Introduction

In this project, I challenged myself to **transform notebook-based code for model training into a Kedro pipeline**. The goal is to create modular, easy-to-train pipelines that follow the **best MLOps practices**, simplifying the deployment of ML models. With Kedro, you can execute just one command to train your models and obtain your pickle files, performance figures, etc. (Yes, just **ONE** command✌️). Parameters can be easily adjusted in a YAML file, allowing for the addition of different steps and the testing of various models with ease. Additionally, Kedro provides **visualization** and **logging features** to keep you informed about everything. **You can create all types of pipelines, not only for machine learning but for any data-driven workflow.**

For an in-depth understanding of **Kedro**, consider exploring the official documentation at [Kedro's Documentation](https://docs.kedro.org/en/stable/introduction/index.html).

Additionally, I integrated a **CI pipeline** on `Github Actions` for **code quality checks** and **code functionality assurance**, enhancing reliability and maintainability βœ…

## 🎯 Project Goals

The objectives were:
- **Transition to Production**: Convert code from `Jupyter Notebooks` to a `production-ready` and easily deployable format.
- **Model Integration**: Facilitate the straightforward addition of models, along with their performance metrics, into the pipeline.
- **Workflow Optimization**: Utilize the `Kedro framework` to establish reproducible, modular, and scalable data workflows.
- **CI/CD Automation**: Implement an **automated CI/CD pipeline** using `GitHub Actions` to ensure continuous testing and code quality management.
- **Dockerization**: Develop a **Dockerized pipeline** for ease of use, incorporating `Docker volumes` for persistent data management.

## πŸ› οΈ Preparation & Prototyping in Notebooks

Before I started making `Kedro pipelines`, I tried out my ideas in Jupyter notebooks. Check the `notebooks` folder to see how I did it:

- **[EDA & Data Preparation - Energy_Forecasting.ipynb](./notebooks/EDA%20&%20Data%20Preparation%20-%20Energy_Forecasting.ipynb)**: Offers insights into how I analyzed the data and prepared it for modeling, including data preparation and cleaning processes.

- **[Machine Learning - Energy_Forecasting.ipynb](./notebooks/Machine%20Learning%20-%20Energy_Forecasting.ipynb)**: Documents how I evaluated and trained various Machine Learning models, testing the **Random Forest**, **XGBoost**, and **LightGBM** models.

## 🧩 Project Workflow

Within the `src` directory lies the essence, with each component neatly arranged in a Kedro pipeline:

- **Data Processing**: Standardizes and cleans data in ZIP and CSV formats, preparing it for analysis. πŸ”
- **Feature Engineering**: Creates new features. πŸ› οΈ
- **Train-Test Split Pipeline**: A dedicated pipeline to split the data into training and test sets. πŸ“Š
- **Model Training + Model Evaluation**: Constructs separate pipelines for **XGBoost**, **LightGBM** and **Random Forest**, modular and independent, capable of training in async mode. πŸ€–

### Kedro Visualization

The `Kedro Viz tool` provides an interactive canvas to visualize and **understand the pipeline structure**. It illustrates data flow, dependencies, and the orchestration of nodes and pipelines. Here is the visualization of this project:

![kedro-pipeline](https://github.com/labrijisaad/Kedro-Energy-Forecasting-Machine-Learning-Pipeline/assets/74627083/c4b87f0c-d08d-4daf-896a-8e40dcf720a9)

With this tool, the understanding of data progression, outputs, and interactivity is greatly simplified. Kedro Viz allows users to inspect samples of data, view parameters, analyze figures, and much more, enriching the user experience with enhanced transparency and interactivity.

## πŸ“œ Logging and Monitoring

Logging is integral to understanding and troubleshooting pipelines. This project leverages Kedro's logging capabilities to provide real-time insights into pipeline execution, highlighting progress, warnings, and errors. This GIF demonstrates the use of the `kedro run` or `make run` command, showcasing the logging output in action:



Notice how the nodes are executed sequentially, and observe the **RMSE outputs during validation** for the **XGBoost model**. Logging in Kedro is highly customizable, allowing for tailored monitoring that meets the user's specific needs.

## πŸ“ Project Structure

A _simplified_ overview of the Kedro project's structure:

```
Kedro-Energy-Forecasting/
β”‚
β”œβ”€β”€ conf/ # Configuration files for Kedro project
β”‚ β”œβ”€β”€ base/
β”‚ β”‚ β”œβ”€β”€ catalog.yml # Data catalog with dataset definitions
β”‚ β”‚ β”œβ”€β”€ parameters_data_processing_pipeline.yml # Parameters for data processing
β”‚ β”‚ β”œβ”€β”€ parameters_feature_engineering_pipeline.yml # Parameters for feature engineering
β”‚ β”‚ β”œβ”€β”€ parameters_random_forest_pipeline.yml # Parameters for Random Forest pipeline
β”‚ β”‚ β”œβ”€β”€ parameters_lightgbm_training_pipeline.yml # Parameters for LightGBM pipeline
β”‚ β”‚ β”œβ”€β”€ parameters_train_test_split_pipeline.yml # Parameters for train-test split
β”‚ β”‚ └── parameters_xgboost_training_pipeline.yml # Parameters for XGBoost training
β”‚ └── local/
β”‚
β”œβ”€β”€ data/
β”‚ β”œβ”€β”€ 01_raw/ # Raw, unprocessed datasets
β”‚ β”œβ”€β”€ 02_processed/ # Cleaned and processed data ready for analysis
β”‚ β”œβ”€β”€ 03_training_data/ # Train/Test Datasets used for model training
β”‚ β”œβ”€β”€ 04_reporting/ # Figures and Results after running the pipelines
β”‚ └── 05_model_output/ # Trained pickle models
β”‚
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ pipelines/
β”‚ β”‚ β”œβ”€β”€ data_processing_pipeline/ # Data processing pipeline
β”‚ β”‚ β”œβ”€β”€ feature_engineering_pipeline/ # Feature engineering pipeline
β”‚ β”‚ β”œβ”€β”€ random_forest_pipeline/ # Random Forest pipeline
β”‚ β”‚ β”œβ”€β”€ lightgbm_training_pipeline/ # LightGBM pipeline
β”‚ β”‚ β”œβ”€β”€ train_test_split_pipeline/ # Train-test split pipeline
β”‚ β”‚ └── xgboost_training_pipeline/ # XGBoost training pipeline
β”‚ └── energy_forecasting_model/ # Main module for the forecasting model
β”‚
β”œβ”€β”€ .gitignore # Untracked files to ignore
β”œβ”€β”€ Makefile # Set of tasks to be executed
β”œβ”€β”€ Dockerfile # Instructions for building a Docker image
β”œβ”€β”€ .dockerignore # Files and directories to ignore in Docker builds
β”œβ”€β”€ README.md # Project documentation and setup guide
└── requirements.txt # Project dependencies
```

## πŸš€ Getting Started

First, **Clone the Repository** to download a copy of the code onto your local machine, and before diving into transforming **raw data** into a **trained pickle Machine Learning model**, please note:

#### πŸ”΄ Important Preparation Steps

Before you begin, please follow these preliminary steps to ensure a smooth setup:

- **Clear Existing Data Directories**: If you're planning to run the pipeline, i recommend removing these directories if they exist: `data/02_processed`, `data/03_training_data`, `data/04_reporting`, and `data/05_model_output` (leave only `data/01_raw` in the `data` folder). They will be recreated or updated once the pipeline runs. These directories are tracked in version control to provide you with a glimpse of the **expected outputs**.

- **Makefile Usage**: To utilize the Makefile for running commands, you must have `make` installed on your system. Follow the instructions in the [installation guide](https://sp21.datastructur.es/materials/guides/make-install.html) to set it up.

Here is an example of the available targets: (you type `make` in the command line)



- **Running the Kedro Pipeline**:
- For **production** environments, initialize your setup by executing `make prep-doc` or using `pip install -r docker-requirements.txt` to install the production dependencies.
- For a **development** environment, where you may want to use **Kedro Viz**, work with **Jupyter notebooks**, or test everything thoroughly, run `make prep-dev` or `pip install -r dev-requirements.txt` to install all the development dependencies.

### 🌿 Standard Method (Conda / venv)

Adopt this method if you prefer a traditional Python development environment setup using Conda or venv.

1. **Set Up the Environment**: Initialize a virtual environment with Conda or venv to isolate and manage your project's dependencies.

2. **Install Dependencies**: Inside your virtual environment, execute `pip install -r dev-requirements.txt` to install the necessary Python libraries.

3. **Run the Kedro Pipeline**: Trigger the pipeline processing by running `make run` or directly with `kedro run`. This step orchestrates your data transformation and modeling.

4. **Review the Results**: Inspect the `04_reporting` and `05_model_output` directories to assess the performance and outcomes of your models.

5. **(Optional) Explore with Kedro Viz**: To visually explore your pipeline's structure and data flows, initiate Kedro Viz using `make viz` or `kedro viz run`.

### 🐳 Docker Method

Prefer this method for a containerized approach, ensuring a consistent development environment across different machines. Ensure Docker is operational on your system before you begin.

1. **Build the Docker Image**: Construct your Docker image with `make build` or `kedro docker build`. This command leverages `dev-requirements.txt` for environment setup. For advanced configurations, see the [Kedro Docker Plugin Documentation](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-docker).

2. **Run the Pipeline Inside a Container**: Execute the pipeline within Docker using `make dockerun` or `kedro docker run`. Kedro-Docker meticulously handles volume mappings to ensure seamless data integration between your local setup and the Docker environment.

3. **Access the Results**: Upon completion, the `04_reporting` and `05_model_output` directories will contain your model's reports and trained files, ready for review.

For additional assistance or to explore more command options, refer to the **Makefile** or consult `kedro --help`.

## 🌌 Next Steps?
With our **Kedro Pipeline** πŸ— now capable of efficiently **transforming raw** data πŸ”„ into **trained models** πŸ€–, and the introduction of a Dockerized environment 🐳 for our code, the next phase involves _advancing beyond the current repository scope_ πŸš€ to `orchestrate data updates automatically` using tools like **Databricks**, **Airflow**, **Azure Data Factory**... This progression allows for the seamless integration of fresh data into our models.

Moreover, implementing `experiment tracking and versioning` with **MLflow** πŸ“Š or leveraging **Kedro Viz**'s versioning capabilities πŸ“ˆ will significantly enhance our project's management and reproducibility. These steps are pivotal for maintaining a clean machine learning workflow that not only achieves our goal of simplifying model training processes πŸ›  but also ensures our system remains dynamic and scalable with **minimal effort**.

## 🌐 Let's Connect!

You can connect with me on **LinkedIn** or check out my **GitHub repositories**:



LinkedIn


GitHub