Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/danhenriquex/dvc-pipeline

Machine Learning pipeline with DVC
https://github.com/danhenriquex/dvc-pipeline

dvc-pipeline machine-learning python torch

Last synced: about 2 months ago
JSON representation

Machine Learning pipeline with DVC

Host: GitHub
URL: https://github.com/danhenriquex/dvc-pipeline
Owner: danhenriquex
Created: 2024-08-16T15:38:50.000Z (5 months ago)
Default Branch: main
Last Pushed: 2024-09-08T15:37:56.000Z (4 months ago)
Last Synced: 2024-09-08T17:40:23.077Z (4 months ago)
Topics: dvc-pipeline, machine-learning, python, torch
Language: Python
Homepage:
Size: 49.9 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

🚗 AI Project Managment

Learning DVC integration for Machine Learning projects.

🚀 This project demonstrates a full machine learning pipeline using DVC (Data Version Control) and Python. The pipeline includes data preparation, feature engineering, model training, and evaluation.

Overview •
Technologies and Tools Used •
Project Structure •
Getting Started •
Running the Pipeline
What I Learned

🚧 Machine Learning Project 🚀 Finished 🚧

### Overview

This project aims to showcase the use of DVC in managing a machine learning pipeline. The project is organized into modular Python scripts, each responsible for a specific part of the pipeline. The main goal is to create a reproducible and scalable workflow for machine learning experiments.

### Features

- Python: The programming language used for the entire pipeline.
- DVC (Data Version Control): Used for tracking data, models, and experiments.
- PyTorch: Used for building and training the machine learning model (if applicable).
- Scikit-learn: Used for data preparation and feature engineering (if applicable).
- Pandas: For data manipulation and analysis.
- Git: For version control.
- Google Drive: Used as a remote storage for DVC (optional).
- Datasets: MNIST and CIFAR10

### Project Structure

```bash

├── prepare_data.py # Script to prepare and clean the data
├── train.py # Script to train the machine learning model
├── make_features.py # Script to create features from the raw data
├── evaluate.py # Script to evaluate the trained model
├── dvc.yaml # DVC pipeline configuration
├── .dvc/ # DVC metadata directory
├── .gitignore # Git ignore file
├── README.md # Project documentation (this file)
└── data/ # Directory containing the data (managed by DVC)
```

### Scripts Overview

- prepare_data.py: Handles data loading, cleaning, and preprocessing.
- make_features.py: Extracts features from the preprocessed data and saves them for model training.
- train.py: Trains the machine learning model using the prepared features.
- evaluate.py: Evaluates the trained model on a test set and reports the performance.

### What I learned

- DVC: How to use DVC to version control data, track experiments, and manage model files.
- Pipeline Structuring: The importance of organizing a machine learning project into modular scripts to create a clear and maintainable workflow.
- Reproducibility: Ensuring that experiments are reproducible by tracking data, code, and configurations.

### Author

---