{"id":16597962,"url":"https://github.com/alexioannides/ml-workflow-automation","last_synced_at":"2025-03-21T13:32:33.577Z","repository":{"id":37230827,"uuid":"170168177","full_name":"AlexIoannides/ml-workflow-automation","owner":"AlexIoannides","description":"Python Machine Learning (ML) project that demonstrates the archetypal ML workflow within a Jupyter notebook, with automated model deployment as a RESTful service on Kubernetes.","archived":false,"fork":false,"pushed_at":"2023-02-15T20:43:52.000Z","size":4316,"stargazers_count":61,"open_issues_count":8,"forks_count":29,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-03-01T06:51:12.125Z","etag":null,"topics":["classification","data-science","flask","helm","jupyter-notebook","kaggle","kubernetes","machine-learning","mlops","numpy","pandas","python","rest-api","sklearn"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AlexIoannides.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-02-11T17:07:37.000Z","updated_at":"2025-02-10T22:52:13.000Z","dependencies_parsed_at":"2024-01-15T03:41:45.525Z","dependency_job_id":null,"html_url":"https://github.com/AlexIoannides/ml-workflow-automation","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AlexIoannides%2Fml-workflow-automation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AlexIoannides%2Fml-workflow-automation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AlexIoannides%2Fml-workflow-automation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AlexIoannides%2Fml-workflow-automation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AlexIoannides","download_url":"https://codeload.github.com/AlexIoannides/ml-workflow-automation/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244141580,"owners_count":20404835,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["classification","data-science","flask","helm","jupyter-notebook","kaggle","kubernetes","machine-learning","mlops","numpy","pandas","python","rest-api","sklearn"],"created_at":"2024-10-12T00:07:12.347Z","updated_at":"2025-03-21T13:32:31.942Z","avatar_url":"https://github.com/AlexIoannides.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Automating the Archetypal Machine Learning Workflow and Model Deployment\n\nThis repository contains a Python-based Machine Learning (ML) project, whose primary aim is to demonstrate the archetypal ML workflow within a Jupyter notebook, together with some proof-of-concept ideas on automating key steps, using the Titanic binary classification dataset hosted on [Kaggle](https://www.kaggle.com). The ML workflow includes: data exploration and visualisation, feature engineering, model training and selection. The notebook - `titanic-ml.ipynb` - also yields a persisted prediction pipeline (pickled to the `models` directory), that is used downstream in the model deployment process. Note, that we have already downloaded the data from Kaggle, in CSV format, to the `data` directory of this project's root directory.\n\nThe secondary aim of this project, is to demonstrate how the deployment of the model generated as a 'build artefact' of the modelling notebook, can be automatically deployed as a managed RESTful prediction service on Kubernetes, without having to write **any** custom code. The full details are contained in the `deploy/deploy-model.ipynb` notebook, where we lean very heavily on the approaches discussed [here](https://github.com/AlexIoannides/kubernetes-ml-ops).\n\n## Managing Project Dependencies using Pipenv\n\nWe use [pipenv](https://docs.pipenv.org) for managing project dependencies and Python environments (i.e. virtual environments). All of the direct packages dependencies required to run the code (e.g. NumPy for arrays/tensors and Pandas for DataFrames), as well as all the packages used during development (e.g. flake8 for code linting and IPython for interactive console sessions), are described in the `Pipfile`. Their **precise** downstream dependencies are described in `Pipfile.lock`.\n\n### Installing Pipenv\n\nTo get started with Pipenv, first of all download it - assuming that there is a global version of Python available on your system and on the PATH, then this can be achieved by running the following command,\n\n```bash\npip3 install pipenv\n```\n\nPipenv is also available to install from many non-Python package managers. For example, on OS X it can be installed using the [Homebrew](https://brew.sh) package manager, with the following terminal command,\n\n```bash\nbrew install pipenv\n```\n\nFor more information, including advanced configuration options, see the [official pipenv documentation](https://docs.pipenv.org).\n\n### Installing this Projects' Dependencies\n\nMake sure that you're in the project's root directory (the same one in which the `Pipfile` resides), and then run,\n\n```bash\npipenv install --dev\n```\n\nThis will install all of the direct project dependencies as well as the development dependencies (the latter a consequence of the `--dev` flag).\n\n### Running Python, IPython and JupyterLab from the Project's Virtual Environment\n\nIn order to continue development in a Python environment that precisely mimics the one the project was initially developed with, use Pipenv from the command line as follows,\n\n```bash\npipenv run python3\n```\n\nThe `python3` command could just as well be `ipython3` or the JupterLab, for example,\n\n```bash\npipenv run jupyter lab\n```\n\nThis will fire-up a JupyterLab *where the default Python 3 kernel includes all of the direct and development project dependencies*. This is how we advise that the notebooks within this project are used.\n\n### Automatic Loading of Environment Variables\n\nPipenv will automatically pick-up and load any environment variables declared in the `.env` file, located in the package's root directory. For example, adding,\n\n```bash\nSPARK_HOME=applications/spark-2.3.1/bin\n```\n\nWill enable access to this variable within any Python program, via a call to `os.environ['SPARK_HOME']`. Note, that if any security credentials are placed here, then this file **must** be removed from source control - i.e. add `.env` to the `.gitignore` file to prevent potential security risks.\n\n### Pipenv Shells\n\nPrepending `pipenv` to every command you want to run within the context of your Pipenv-managed virtual environment, can get very tedious. This can be avoided by entering into a Pipenv-managed shell,\n\n```bash\npipenv shell\n```\n\nwhich is equivalent to 'activating' the virtual environment. Any command will now be executed within the virtual environment. Use `exit` to leave the shell session.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falexioannides%2Fml-workflow-automation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falexioannides%2Fml-workflow-automation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falexioannides%2Fml-workflow-automation/lists"}