Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/neemiasbsilva/kedro_orchestrate_dl_transformer_arch
A pipeline using Kedro to orchestrate the deployment of a deep learning transformer model for classifying toxic comments. This project integrates data preprocessing, model training, and deployment into a streamlined and reproducible workflow, enabling efficient handling of the toxic comment classification problem in NLP.
https://github.com/neemiasbsilva/kedro_orchestrate_dl_transformer_arch
deep-learning fastapi grafana kedro mlflow mlops-project mlops-workflow nlp prometheus
Last synced: 11 days ago
JSON representation
A pipeline using Kedro to orchestrate the deployment of a deep learning transformer model for classifying toxic comments. This project integrates data preprocessing, model training, and deployment into a streamlined and reproducible workflow, enabling efficient handling of the toxic comment classification problem in NLP.
- Host: GitHub
- URL: https://github.com/neemiasbsilva/kedro_orchestrate_dl_transformer_arch
- Owner: neemiasbsilva
- Created: 2024-12-15T22:38:36.000Z (24 days ago)
- Default Branch: main
- Last Pushed: 2024-12-28T16:23:22.000Z (11 days ago)
- Last Synced: 2024-12-28T16:30:56.890Z (11 days ago)
- Topics: deep-learning, fastapi, grafana, kedro, mlflow, mlops-project, mlops-workflow, nlp, prometheus
- Language: Python
- Homepage:
- Size: 925 KB
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Applying Kedro Orchestration Pipelines to Deploy a Deep Learning Transformer Architecture for the Toxic Comment Classification Problem
[![Powered by Kedro](https://img.shields.io/badge/powered_by-kedro-ffc900?logo=kedro)](https://kedro.org)
![TensorFlow](https://img.shields.io/badge/TensorFlow-%23FF6F00.svg?style=for-the-badge&logo=TensorFlow&logoColor=white)
![FastAPI](https://img.shields.io/badge/FastAPI-005571?style=for-the-badge&logo=fastapi)
![mlflow](https://img.shields.io/badge/mlflow-%23d9ead3.svg?style=for-the-badge&logo=numpy&logoColor=blue)## Project Overview
This project leverages Kedro, a data and machine learning pipeline orchestration tool, to classify toxic comments using deep learning transformer architecture. It integrates Kedro-Viz for pipeline visualization and follows data engineering best practices for reproducibility and maintainability.
### Key Features
- **Pipeline Visualization**: Integrates Kedro-Viz to provide an interactive visual representation of data workflows for better debugging and workflow comprehension.
- **Multilingual Dataset Support**: Handles multilingual datasets, including English, with preprocessing pipelines optimized for transformer models.
- **Transformer-Based Modeling**: Supports state-of-the-art deep learning models, such as BERT, for toxicity prediction.### Tech Stack
- Kedro `v0.19.10`
- TensorFlow for model training
- FastAPI for serving predictions
- MLflow for experiment tracking
- Docker for containerization
- Prometheus and Grafana for monitoringFor more information, see the [Kedro documentation](https://docs.kedro.org).
---
## System Design
![alt text](figures/MLOps.png)
---
## Kedro Pipeline Overview
![alt text](figures/kedro_pipeline.png)
---
## Dataset Overview
This project utilizes datasets from the Jigsaw ["Toxic Comment Classification"](https://www.kaggle.com/competitions/jigsaw-multilingual-toxic-comment-classification/data?select=test.csv) competition. The datasets include labeled comments sourced from platforms like Wikipedia and Civil Comments.
Description:
- Comment text: The primary data containing user-submitted comments;
- Toxic Column: A binary label where `1` indicates toxic comments, and `0` indicates non-toxic comments.After that, you need to download the follow files:
- `jigsaw-toxic-comment-train.csv`;
- `validation.csv`;
- `test.csv`.For all the files, add in the follow path: `data/01_raw/`
---
## Rules and guidelines
To ensure consistency and reproducibility:
1. **.gitignore Compliance**: Do not remove any lines from the provided `.gitignore` file.
2. **Data Engineering Convention**: Follow [Kedro's data engineering conventions](https://docs.kedro.org/en/stable/faq/faq.html#what-is-data-engineering-convention).
3. **Data Handling**: Do not commit raw datasets to the repository.
4. **Credential Security**: Avoid committing credentials or local configurations to the repository. Store these in `conf/local/`.---
## Get started
### Installation
Install the required dependencies:
```bash
pip install -r requirements.txt
```### Running the Pipeline
Execute the Kedro pipeline:
```bash
kedro run
```### Testing the Pipeline
Run tests using `pytest`:
```bash
pytest
```### Deploying the Model API
To serve the champions model, you can build the container `toxic-comment-endpoint` using the `fastapi` package for serving.
1. Build the Docker image:
```bash
docker build -t toxic-comment-endpoint .
```2. Run the Docker container:
```bash
docker run -p 8001:8001 -e PORT=8001 toxic-comment-endpoint
```3. Click on the link `http://0.0.0.0:8001` and you see the follow web application
![alt text](figures/web_api.png)
---
## Monitoring with Prometheus and Grafana
Create a `prometheus.yml` file for monitoring and sending metrics to Grafana:
```yml
global:
scrape_interval: 15sscrape_configs:
- job_name: "fastapi"
metrics_path: "/metrics"
static_configs:
- targets: ["0.0.0.0:8001"]remote_write:
- url: ""
remote_timeout: "30s"
send_exemplars: false
follow_redirects: true
basic_auth:
username: ""
password: ""
```Run Prometheus with the configuration file:
```bash
prometheus --config.file=prometheus.yml
```### Grafana Visualization
![alt text](figures/grafana_dashboard.png)
## Additional Resources
- [Kedro Documentation](https://docs.kedro.org)
- [TensorFlow Documentation](https://www.tensorflow.org)
- [FastAPI Documentation](https://fastapi.tiangolo.com)
- [MLflow Documentation](https://mlflow.org)
- [Prometheus Documentation](https://prometheus.io)
- [Grafana Documentation](https://grafana.com)