{"id":16507606,"url":"https://github.com/tmthyjames/ds-meetup-ml-repro","last_synced_at":"2025-06-10T16:07:58.613Z","repository":{"id":61708269,"uuid":"554341052","full_name":"tmthyjames/ds-meetup-ml-repro","owner":"tmthyjames","description":"Data Science Chattanooga Meetup material for the talk on Reproducible Machine Learning","archived":false,"fork":false,"pushed_at":"2022-11-11T14:20:50.000Z","size":17685,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-12T15:31:30.680Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tmthyjames.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-10-19T16:44:51.000Z","updated_at":"2022-11-03T02:03:24.000Z","dependencies_parsed_at":"2023-01-21T14:15:27.116Z","dependency_job_id":null,"html_url":"https://github.com/tmthyjames/ds-meetup-ml-repro","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tmthyjames%2Fds-meetup-ml-repro","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tmthyjames%2Fds-meetup-ml-repro/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tmthyjames%2Fds-meetup-ml-repro/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tmthyjames%2Fds-meetup-ml-repro/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tmthyjames","download_url":"https://codeload.github.com/tmthyjames/ds-meetup-ml-repro/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241451560,"owners_count":19964900,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-11T15:29:21.663Z","updated_at":"2025-03-02T02:43:40.894Z","avatar_url":"https://github.com/tmthyjames.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# reproml\n\nThis repo contains the `reproml` project.\n\n## Just let me run it!\n\nTo pull down and reproduce everything and get access to the CLI just run:\n\nif you don't already have `pipenv`, you can install it using `pip`\n```commandline\npip install pipenv\n```\n\nThen do\n\n```commandline\ngit clone https://github.com/tmthyjames/ds-meetup-ml-repro.git\n\ncd ds-meetup-ml-repro\n\npipenv install --dev\n```\n\nThat's it! You're ready to start reproducing the data and ML pipelines.\n\n## Now you have access to the `reproml` CLI:\n\nTo activate the virtual env shell\n```commandline\npipenv shell\n```\n\nNow you can run the `reproml` commands:\n\n```commandline\n❯ reproml --help                                                                                                                                 ─╯\nUsage: reproml [OPTIONS] COMMAND [ARGS]...\n\nOptions:\n  --version  Show the version and exit.\n  --help     Show this message and exit.\n\nCommands:\n  etl\n  ml\n  prepro\n  validate\n```\n\nWhich comes with sub commands for each phase (ETL, Preprocessing, ML, Validating).\nTo view the help page for each subcommand run `reproml \u003csubcommand\u003e --help` like so:\n\n```commandline\n❯ reproml etl --help                                                                                                                             ─╯\nUsage: reproml etl [OPTIONS] COMMAND [ARGS]...\n\nOptions:\n  --help  Show this message and exit.\n\nCommands:\n  get-lyrics  run the lyrics data jobs.\n  run-all     Run all jobs\n```\n\nSo to run the ETL just do this:\n\n```commandline\nreproml etl get-lyrics\n```\n\nAnd this will populate the `data/raw` folder, then you can run the propro subcommand\nto populate the `data/processed` folder:\n\n```commandline\nreproml prepro process\n```\n\nYou can also import these commands into a notebook environment and use as python functions:\n\n```python\nimport pandas as pd\nfrom reproml.etl.lyrics import get_lyrics\nfrom reproml.preprocess import process_lyrics\n\nraw_path = get_lyrics()\nprepro_path = process_lyrics(srcpath=raw_path)\ndf = pd.read_parquet(prepro_path)\n\ndf.head()\n```\n\n## DVC (Data Version Control)\n\nTo run this with DVC and thus track all the outputs and dependencies:\n\n```commandline\ndvc repro\n```\n\nWill reproduce all the stages (etl, ml, preprocessing, validation). Here's a DAG to see the full pipeline:\n\n```commandline\n❯ dvc dag\n               +------------+\n               | get-lyrics |\n               +------------+\n                      *\n                      *\n                      *\n              +---------------+\n              | prepro-lyrics |\n              +---------------+\n                      *\n                      *\n                      *\n              +--------------+\n              | split-lyrics |\n              +--------------+\n              ***            ***\n            **                  **\n          **                      **\n+-------------+                     **\n| train-model |                   **\n+-------------+                 **\n              ***            ***\n                 **        **\n                   **    **\n             +----------------+\n             | validate-model |\n             +----------------+\n```\n\n# To run individual stages with DVC:\n\nThe names of the nodes in the DAG are the names of the stages, so to run `get-lyrics`\nfor example:\n\n```commandline\ndvc repro -s get-lyrics\n```\n\n## To contribute or set up for development, here are the development pre-requisites\n\nTo work with this repo install the following prerequisites:\n\n* python 3.8+\n* pre-commit:\n\n```\nbrew install pre-commit\n```\n\n* pipenv\n\n```\nconda install pip\npip install pipenv\n```\n\n**Setup for Development**\n\nAfter prerequisites are installed, run the following commands to clone the repo and configure it for development:\n\n```\ncd \u003cREPO_ROOT\u003e\n\n# Install pre-commit hooks to local clone\npre-commit install\n\n# Install pipenv environment\npipenv install --dev\n\n# Create IPython Kernel for the virtual environment\npipenv run ipython kernel install --user --name=reproml\n```\n\nThe `pipenv install` command above creates an isolated virtual environment for this repo with\nall dev dependencies installed. There are two main ways of using the virtual environment. To\nrun a one off command in the environment use `pipenv run`. For example, the following command\nwill show the location of the python executable for the environment:\n\n```\npipenv run which python\n```\n\nAlternatively, for running many commands `pipenv shell` is used to spawn a new shell in the\nvirtual environment.\n\n```\ncd \u003cREPO_ROOT\u003e\npipenv shell\n```\n\n## Running Tests\n\nTo run the unit tests run the following command (optionally adding `--cov` for a coverage report):\n\n```\npytest\n```\n\n## Troubleshooting\n\nIf you ever have trouble with the python environment,\nmany problems can be resolved by \"rebooting\" it\nby running these commands in the repo root:\n\n```\npipenv --rm\npipenv install --dev\n```\n\nThis will resolve common problems with packages not being found, etc.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftmthyjames%2Fds-meetup-ml-repro","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftmthyjames%2Fds-meetup-ml-repro","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftmthyjames%2Fds-meetup-ml-repro/lists"}