Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/danhenriquex/dvc-pipeline
Machine Learning pipeline with DVC
https://github.com/danhenriquex/dvc-pipeline
dvc-pipeline machine-learning python torch
Last synced: 4 days ago
JSON representation
Machine Learning pipeline with DVC
- Host: GitHub
- URL: https://github.com/danhenriquex/dvc-pipeline
- Owner: danhenriquex
- Created: 2024-08-16T15:38:50.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2024-09-08T15:37:56.000Z (2 months ago)
- Last Synced: 2024-09-08T17:40:23.077Z (2 months ago)
- Topics: dvc-pipeline, machine-learning, python, torch
- Language: Python
- Homepage:
- Size: 49.9 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
🚗 AI Project Managment
Learning DVC integration for Machine Learning projects.
🚀 This project demonstrates a full machine learning pipeline using DVC (Data Version Control) and Python. The pipeline includes data preparation, feature engineering, model training, and evaluation.
Overview •
Technologies and Tools Used •
Project Structure •
Getting Started •
Running the Pipeline
What I Learned
🚧 Machine Learning Project 🚀 Finished 🚧### Overview
This project aims to showcase the use of DVC in managing a machine learning pipeline. The project is organized into modular Python scripts, each responsible for a specific part of the pipeline. The main goal is to create a reproducible and scalable workflow for machine learning experiments.### Features
- Python: The programming language used for the entire pipeline.
- DVC (Data Version Control): Used for tracking data, models, and experiments.
- PyTorch: Used for building and training the machine learning model (if applicable).
- Scikit-learn: Used for data preparation and feature engineering (if applicable).
- Pandas: For data manipulation and analysis.
- Git: For version control.
- Google Drive: Used as a remote storage for DVC (optional).
- Datasets: MNIST and CIFAR10### Project Structure
```bash
├── prepare_data.py # Script to prepare and clean the data
├── train.py # Script to train the machine learning model
├── make_features.py # Script to create features from the raw data
├── evaluate.py # Script to evaluate the trained model
├── dvc.yaml # DVC pipeline configuration
├── .dvc/ # DVC metadata directory
├── .gitignore # Git ignore file
├── README.md # Project documentation (this file)
└── data/ # Directory containing the data (managed by DVC)
```### Scripts Overview
- prepare_data.py: Handles data loading, cleaning, and preprocessing.
- make_features.py: Extracts features from the preprocessed data and saves them for model training.
- train.py: Trains the machine learning model using the prepared features.
- evaluate.py: Evaluates the trained model on a test set and reports the performance.### What I learned
- DVC: How to use DVC to version control data, track experiments, and manage model files.
- Pipeline Structuring: The importance of organizing a machine learning project into modular scripts to create a clear and maintainable workflow.
- Reproducibility: Ensuring that experiments are reproducible by tracking data, code, and configurations.### Author
---