Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/labrijisaad/kedro-energy-forecasting-machine-learning-pipeline
This repo showcases a project that transforms ML model training into a simplified, production-ready Kedro Dockerized Pipeline. It emphasizes best MLOps practices, enabling easy training, evaluation, and deployment of models, including XGBoost, LightGBM and Random Forest, with built-in visualization and logging features for effective monitoring.
https://github.com/labrijisaad/kedro-energy-forecasting-machine-learning-pipeline
cicd docker docker-volume github-actions jupyter-notebook kedro kedro-docker lightgbm machine-learning-pipeline mlops mlops-workflow random-forest xgboost
Last synced: about 2 months ago
JSON representation
This repo showcases a project that transforms ML model training into a simplified, production-ready Kedro Dockerized Pipeline. It emphasizes best MLOps practices, enabling easy training, evaluation, and deployment of models, including XGBoost, LightGBM and Random Forest, with built-in visualization and logging features for effective monitoring.
- Host: GitHub
- URL: https://github.com/labrijisaad/kedro-energy-forecasting-machine-learning-pipeline
- Owner: labrijisaad
- Created: 2024-03-27T12:05:30.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-04-05T21:04:38.000Z (9 months ago)
- Last Synced: 2024-04-05T22:45:56.189Z (9 months ago)
- Topics: cicd, docker, docker-volume, github-actions, jupyter-notebook, kedro, kedro-docker, lightgbm, machine-learning-pipeline, mlops, mlops-workflow, random-forest, xgboost
- Language: Jupyter Notebook
- Homepage:
- Size: 27.2 MB
- Stars: 6
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# `Kedro` Machine Learning Pipeline π―
"This DALL-E generated image, within Japan, Kedro orchestrates the rhythm of renewable insights amidst the choreography of data and predictions."## π Introduction
In this project, I challenged myself to **transform notebook-based code for model training into a Kedro pipeline**. The goal is to create modular, easy-to-train pipelines that follow the **best MLOps practices**, simplifying the deployment of ML models. With Kedro, you can execute just one command to train your models and obtain your pickle files, performance figures, etc. (Yes, just **ONE** commandβοΈ). Parameters can be easily adjusted in a YAML file, allowing for the addition of different steps and the testing of various models with ease. Additionally, Kedro provides **visualization** and **logging features** to keep you informed about everything. **You can create all types of pipelines, not only for machine learning but for any data-driven workflow.**
For an in-depth understanding of **Kedro**, consider exploring the official documentation at [Kedro's Documentation](https://docs.kedro.org/en/stable/introduction/index.html).
Additionally, I integrated a **CI pipeline** on `Github Actions` for **code quality checks** and **code functionality assurance**, enhancing reliability and maintainability β
## π― Project Goals
The objectives were:
- **Transition to Production**: Convert code from `Jupyter Notebooks` to a `production-ready` and easily deployable format.
- **Model Integration**: Facilitate the straightforward addition of models, along with their performance metrics, into the pipeline.
- **Workflow Optimization**: Utilize the `Kedro framework` to establish reproducible, modular, and scalable data workflows.
- **CI/CD Automation**: Implement an **automated CI/CD pipeline** using `GitHub Actions` to ensure continuous testing and code quality management.
- **Dockerization**: Develop a **Dockerized pipeline** for ease of use, incorporating `Docker volumes` for persistent data management.## π οΈ Preparation & Prototyping in Notebooks
Before I started making `Kedro pipelines`, I tried out my ideas in Jupyter notebooks. Check the `notebooks` folder to see how I did it:
- **[EDA & Data Preparation - Energy_Forecasting.ipynb](./notebooks/EDA%20&%20Data%20Preparation%20-%20Energy_Forecasting.ipynb)**: Offers insights into how I analyzed the data and prepared it for modeling, including data preparation and cleaning processes.
- **[Machine Learning - Energy_Forecasting.ipynb](./notebooks/Machine%20Learning%20-%20Energy_Forecasting.ipynb)**: Documents how I evaluated and trained various Machine Learning models, testing the **Random Forest**, **XGBoost**, and **LightGBM** models.## 𧩠Project Workflow
Within the `src` directory lies the essence, with each component neatly arranged in a Kedro pipeline:
- **Data Processing**: Standardizes and cleans data in ZIP and CSV formats, preparing it for analysis. π
- **Feature Engineering**: Creates new features. π οΈ
- **Train-Test Split Pipeline**: A dedicated pipeline to split the data into training and test sets. π
- **Model Training + Model Evaluation**: Constructs separate pipelines for **XGBoost**, **LightGBM** and **Random Forest**, modular and independent, capable of training in async mode. π€### Kedro Visualization
The `Kedro Viz tool` provides an interactive canvas to visualize and **understand the pipeline structure**. It illustrates data flow, dependencies, and the orchestration of nodes and pipelines. Here is the visualization of this project:
![kedro-pipeline](https://github.com/labrijisaad/Kedro-Energy-Forecasting-Machine-Learning-Pipeline/assets/74627083/c4b87f0c-d08d-4daf-896a-8e40dcf720a9)
With this tool, the understanding of data progression, outputs, and interactivity is greatly simplified. Kedro Viz allows users to inspect samples of data, view parameters, analyze figures, and much more, enriching the user experience with enhanced transparency and interactivity.
## π Logging and Monitoring
Logging is integral to understanding and troubleshooting pipelines. This project leverages Kedro's logging capabilities to provide real-time insights into pipeline execution, highlighting progress, warnings, and errors. This GIF demonstrates the use of the `kedro run` or `make run` command, showcasing the logging output in action:
Notice how the nodes are executed sequentially, and observe the **RMSE outputs during validation** for the **XGBoost model**. Logging in Kedro is highly customizable, allowing for tailored monitoring that meets the user's specific needs.
## π Project Structure
A _simplified_ overview of the Kedro project's structure:
```
Kedro-Energy-Forecasting/
β
βββ conf/ # Configuration files for Kedro project
β βββ base/
β β βββ catalog.yml # Data catalog with dataset definitions
β β βββ parameters_data_processing_pipeline.yml # Parameters for data processing
β β βββ parameters_feature_engineering_pipeline.yml # Parameters for feature engineering
β β βββ parameters_random_forest_pipeline.yml # Parameters for Random Forest pipeline
β β βββ parameters_lightgbm_training_pipeline.yml # Parameters for LightGBM pipeline
β β βββ parameters_train_test_split_pipeline.yml # Parameters for train-test split
β β βββ parameters_xgboost_training_pipeline.yml # Parameters for XGBoost training
β βββ local/
β
βββ data/
β βββ 01_raw/ # Raw, unprocessed datasets
β βββ 02_processed/ # Cleaned and processed data ready for analysis
β βββ 03_training_data/ # Train/Test Datasets used for model training
β βββ 04_reporting/ # Figures and Results after running the pipelines
β βββ 05_model_output/ # Trained pickle models
β
βββ src/
β βββ pipelines/
β β βββ data_processing_pipeline/ # Data processing pipeline
β β βββ feature_engineering_pipeline/ # Feature engineering pipeline
β β βββ random_forest_pipeline/ # Random Forest pipeline
β β βββ lightgbm_training_pipeline/ # LightGBM pipeline
β β βββ train_test_split_pipeline/ # Train-test split pipeline
β β βββ xgboost_training_pipeline/ # XGBoost training pipeline
β βββ energy_forecasting_model/ # Main module for the forecasting model
β
βββ .gitignore # Untracked files to ignore
βββ Makefile # Set of tasks to be executed
βββ Dockerfile # Instructions for building a Docker image
βββ .dockerignore # Files and directories to ignore in Docker builds
βββ README.md # Project documentation and setup guide
βββ requirements.txt # Project dependencies
```## π Getting Started
First, **Clone the Repository** to download a copy of the code onto your local machine, and before diving into transforming **raw data** into a **trained pickle Machine Learning model**, please note:
#### π΄ Important Preparation Steps
Before you begin, please follow these preliminary steps to ensure a smooth setup:
- **Clear Existing Data Directories**: If you're planning to run the pipeline, i recommend removing these directories if they exist: `data/02_processed`, `data/03_training_data`, `data/04_reporting`, and `data/05_model_output` (leave only `data/01_raw` in the `data` folder). They will be recreated or updated once the pipeline runs. These directories are tracked in version control to provide you with a glimpse of the **expected outputs**.
- **Makefile Usage**: To utilize the Makefile for running commands, you must have `make` installed on your system. Follow the instructions in the [installation guide](https://sp21.datastructur.es/materials/guides/make-install.html) to set it up.
Here is an example of the available targets: (you type `make` in the command line)
- **Running the Kedro Pipeline**:
- For **production** environments, initialize your setup by executing `make prep-doc` or using `pip install -r docker-requirements.txt` to install the production dependencies.
- For a **development** environment, where you may want to use **Kedro Viz**, work with **Jupyter notebooks**, or test everything thoroughly, run `make prep-dev` or `pip install -r dev-requirements.txt` to install all the development dependencies.### πΏ Standard Method (Conda / venv)
Adopt this method if you prefer a traditional Python development environment setup using Conda or venv.
1. **Set Up the Environment**: Initialize a virtual environment with Conda or venv to isolate and manage your project's dependencies.
2. **Install Dependencies**: Inside your virtual environment, execute `pip install -r dev-requirements.txt` to install the necessary Python libraries.
3. **Run the Kedro Pipeline**: Trigger the pipeline processing by running `make run` or directly with `kedro run`. This step orchestrates your data transformation and modeling.
4. **Review the Results**: Inspect the `04_reporting` and `05_model_output` directories to assess the performance and outcomes of your models.
5. **(Optional) Explore with Kedro Viz**: To visually explore your pipeline's structure and data flows, initiate Kedro Viz using `make viz` or `kedro viz run`.### π³ Docker Method
Prefer this method for a containerized approach, ensuring a consistent development environment across different machines. Ensure Docker is operational on your system before you begin.
1. **Build the Docker Image**: Construct your Docker image with `make build` or `kedro docker build`. This command leverages `dev-requirements.txt` for environment setup. For advanced configurations, see the [Kedro Docker Plugin Documentation](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-docker).
2. **Run the Pipeline Inside a Container**: Execute the pipeline within Docker using `make dockerun` or `kedro docker run`. Kedro-Docker meticulously handles volume mappings to ensure seamless data integration between your local setup and the Docker environment.
3. **Access the Results**: Upon completion, the `04_reporting` and `05_model_output` directories will contain your model's reports and trained files, ready for review.For additional assistance or to explore more command options, refer to the **Makefile** or consult `kedro --help`.
## π Next Steps?
With our **Kedro Pipeline** π now capable of efficiently **transforming raw** data π into **trained models** π€, and the introduction of a Dockerized environment π³ for our code, the next phase involves _advancing beyond the current repository scope_ π to `orchestrate data updates automatically` using tools like **Databricks**, **Airflow**, **Azure Data Factory**... This progression allows for the seamless integration of fresh data into our models.Moreover, implementing `experiment tracking and versioning` with **MLflow** π or leveraging **Kedro Viz**'s versioning capabilities π will significantly enhance our project's management and reproducibility. These steps are pivotal for maintaining a clean machine learning workflow that not only achieves our goal of simplifying model training processes π but also ensures our system remains dynamic and scalable with **minimal effort**.
## π Let's Connect!
You can connect with me on **LinkedIn** or check out my **GitHub repositories**: