https://github.com/tmthyjames/ds-meetup-ml-repro
Data Science Chattanooga Meetup material for the talk on Reproducible Machine Learning
https://github.com/tmthyjames/ds-meetup-ml-repro
Last synced: about 1 year ago
JSON representation
Data Science Chattanooga Meetup material for the talk on Reproducible Machine Learning
- Host: GitHub
- URL: https://github.com/tmthyjames/ds-meetup-ml-repro
- Owner: tmthyjames
- Created: 2022-10-19T16:44:51.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-11-11T14:20:50.000Z (over 3 years ago)
- Last Synced: 2025-01-12T15:31:30.680Z (over 1 year ago)
- Language: Jupyter Notebook
- Size: 16.9 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# reproml
This repo contains the `reproml` project.
## Just let me run it!
To pull down and reproduce everything and get access to the CLI just run:
if you don't already have `pipenv`, you can install it using `pip`
```commandline
pip install pipenv
```
Then do
```commandline
git clone https://github.com/tmthyjames/ds-meetup-ml-repro.git
cd ds-meetup-ml-repro
pipenv install --dev
```
That's it! You're ready to start reproducing the data and ML pipelines.
## Now you have access to the `reproml` CLI:
To activate the virtual env shell
```commandline
pipenv shell
```
Now you can run the `reproml` commands:
```commandline
❯ reproml --help ─╯
Usage: reproml [OPTIONS] COMMAND [ARGS]...
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
etl
ml
prepro
validate
```
Which comes with sub commands for each phase (ETL, Preprocessing, ML, Validating).
To view the help page for each subcommand run `reproml --help` like so:
```commandline
❯ reproml etl --help ─╯
Usage: reproml etl [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
get-lyrics run the lyrics data jobs.
run-all Run all jobs
```
So to run the ETL just do this:
```commandline
reproml etl get-lyrics
```
And this will populate the `data/raw` folder, then you can run the propro subcommand
to populate the `data/processed` folder:
```commandline
reproml prepro process
```
You can also import these commands into a notebook environment and use as python functions:
```python
import pandas as pd
from reproml.etl.lyrics import get_lyrics
from reproml.preprocess import process_lyrics
raw_path = get_lyrics()
prepro_path = process_lyrics(srcpath=raw_path)
df = pd.read_parquet(prepro_path)
df.head()
```
## DVC (Data Version Control)
To run this with DVC and thus track all the outputs and dependencies:
```commandline
dvc repro
```
Will reproduce all the stages (etl, ml, preprocessing, validation). Here's a DAG to see the full pipeline:
```commandline
❯ dvc dag
+------------+
| get-lyrics |
+------------+
*
*
*
+---------------+
| prepro-lyrics |
+---------------+
*
*
*
+--------------+
| split-lyrics |
+--------------+
*** ***
** **
** **
+-------------+ **
| train-model | **
+-------------+ **
*** ***
** **
** **
+----------------+
| validate-model |
+----------------+
```
# To run individual stages with DVC:
The names of the nodes in the DAG are the names of the stages, so to run `get-lyrics`
for example:
```commandline
dvc repro -s get-lyrics
```
## To contribute or set up for development, here are the development pre-requisites
To work with this repo install the following prerequisites:
* python 3.8+
* pre-commit:
```
brew install pre-commit
```
* pipenv
```
conda install pip
pip install pipenv
```
**Setup for Development**
After prerequisites are installed, run the following commands to clone the repo and configure it for development:
```
cd
# Install pre-commit hooks to local clone
pre-commit install
# Install pipenv environment
pipenv install --dev
# Create IPython Kernel for the virtual environment
pipenv run ipython kernel install --user --name=reproml
```
The `pipenv install` command above creates an isolated virtual environment for this repo with
all dev dependencies installed. There are two main ways of using the virtual environment. To
run a one off command in the environment use `pipenv run`. For example, the following command
will show the location of the python executable for the environment:
```
pipenv run which python
```
Alternatively, for running many commands `pipenv shell` is used to spawn a new shell in the
virtual environment.
```
cd
pipenv shell
```
## Running Tests
To run the unit tests run the following command (optionally adding `--cov` for a coverage report):
```
pytest
```
## Troubleshooting
If you ever have trouble with the python environment,
many problems can be resolved by "rebooting" it
by running these commands in the repo root:
```
pipenv --rm
pipenv install --dev
```
This will resolve common problems with packages not being found, etc.