{"id":18777403,"url":"https://github.com/cloudera/cml_amp_mlflow_tracking","last_synced_at":"2025-04-13T10:31:54.889Z","repository":{"id":42566746,"uuid":"337865149","full_name":"cloudera/CML_AMP_MLFlow_Tracking","owner":"cloudera","description":"Experiment tracking with MLFlow.","archived":false,"fork":false,"pushed_at":"2023-10-02T22:16:38.000Z","size":257,"stargazers_count":5,"open_issues_count":0,"forks_count":14,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-04-05T21:56:05.539Z","etag":null,"topics":["experiment-tracking","mlflow-tracking","mlflow-ui"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cloudera.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-02-10T21:59:18.000Z","updated_at":"2024-09-06T07:32:11.000Z","dependencies_parsed_at":"2024-11-07T20:20:40.075Z","dependency_job_id":null,"html_url":"https://github.com/cloudera/CML_AMP_MLFlow_Tracking","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cloudera%2FCML_AMP_MLFlow_Tracking","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cloudera%2FCML_AMP_MLFlow_Tracking/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cloudera%2FCML_AMP_MLFlow_Tracking/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cloudera%2FCML_AMP_MLFlow_Tracking/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cloudera","download_url":"https://codeload.github.com/cloudera/CML_AMP_MLFlow_Tracking/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248698971,"owners_count":21147565,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["experiment-tracking","mlflow-tracking","mlflow-ui"],"created_at":"2024-11-07T20:10:32.881Z","updated_at":"2025-04-13T10:31:49.877Z","avatar_url":"https://github.com/cloudera.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MLflow for experiment tracking\n\n[MLflow](https://www.mlflow.org/) self describes as\n\n\u003e an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.\n\nIn particular MLflow's experiment tracking capabilities offer a low-friction way of tracking model hyperparameters and metrics across many experiments.\nThis repository demonstrates the use of MLflow tracking in a couple of simple machine learning model training scripts inside Cloudera Machine Learning (CML) and Cloudera Data Science Workbench (CDSW).\n(We will refer only to CML in the remainder of this README, but the code should function equally well in either CML or CDSW).\nThe repository is intended as less a tutorial on MLflow, and more an example of running MLflow inside CML.\nThe AMP does not cover the model registry, project, or deployment capabilities of MLflow.\n\nThe rest of this README is structured as follows.\n\n- [Repository structure](#repository-structure).\n  A brief orientation to the structure of this repository.\n- [Running training scripts](#running-training-scripts).\n  Instructions and setup for running model training and testing, logging experimental results with MLflow.\n- [Viewing the MLflow UI](#viewing-the-mlflow-ui).\n  Using the MLflow UI to view the training logs.\n\n## Repository structure\n\nThe folder structure of this repo is as follows\n\n```\n.\n├── cml       # This folder contains scripts that facilitate the project launch on CML\n└── scripts   # Our analysis code\n```\n\nWhen the training scripts have been run (this will happen on project launch if using the CML Applied ML Prototype interface), an additional `mlruns` directory will appear for use by MLflow.\nThis can be redirected to another location (HDFS, for instance)\n\n### cml\n\nThese scripts are specific to Cloudera Machine Learning, and, with the `.project-metadata.yaml` file in the root directory, allow the project to be deployed automatically, following a declarative specification for jobs, model endpoints and applications.\n\n```\ncml\n├── install_dependencies.py # Script to run pip install of Python dependencies\n└── mlflow_ui.py            # Script to launch MLflow ui application.\n```\n\n### scripts\n\nThis is where all our analysis code lives.\nIn a more involved analysis, we could replace these scripts with jupyter notebooks to run manually, or abstract some re-usable code into a Python libary.\n\n```\nscripts\n├── data.py                 # create fake train and test data\n├── train_kneighbors.py     # train a k-nearest neighbors classifier\n└── train_random_forest.py  # train a random forest classifier\n```\n\n## Launching\n\nThere are three ways to launch this project on CML:\n\n1. **From Prototype Catalog** - Navigate to the Prototype Catalog on a CML workspace, select the \"MLflow Tracking\" tile, click \"Launch as Project\", click \"Configure Project\"\n2. **As ML Prototype** - In a CML workspace, click \"New Project\", add a Project Name, select \"ML Prototype\" as the Initial Setup option, copy in the [repo URL](https://github.com/cloudera/CML_AMP_MLflow_Tracking.git), click \"Create Project\", click \"Configure Project\"\n3. **Manual Setup** - In a CML workspace, click \"New Project\", add a Project Name, select \"Git\" as the Initial Setup option, copy in the [repo URL](https://github.com/cloudera/CML_AMP_MLflow_Tracking.git), click \"Create Project\". Launch a Python3 Workbench Session with at least 2GB of memory and 1vCPU. Then follow the instructions below, in order.\n\n## Running training scripts\n\nIf this repo is imported as an Applied Machine Learning Prototype in CML, the launch process should handle all the setup for you, and you can skip the Installation step.\nIn case you want to run through it manually, follow the instructions in the Installation section below.\n\n### Installation\n\nThe code was developed against Python 3.6.9, and will likely work on more later versions.\nInside a CML Python 3 session, simply run\n\n```\n!pip3 install -r requirements.txt\n```\n\nIn order for Python to pick up the `scripts` directory when running from the command line (see below), we must set an environment variable for the project, setting the `PYTHONPATH` to the root directory of the project.\nUnless you have specifically cloned the project into a different location, this will be `/home/cdsw`.\nSee the [instructions for setting project-level environment variables in CML](https://docs.cloudera.com/machine-learning/cloud/engines/topics/ml-environment-variables.html).\nAlternately, type `export PYTHONPATH=/home/cdsw` in a session terminal.\n\n\n### Training\n\nInside the `scripts/` directory are three scripts, as described above.\nThe `data.py` script creates a fake dataset for a supervised classification problem.\nWhen working with genuine business data, we'd probably be reading this data from a database or flat file storage.\n\nThere are two training scripts.\n\n- `train_kneighbors.py` trains a k-nearest neighbors algorithm, where the number of neighbors to consider is provided as a command line argument.\n- `train_random_forest.py` trains a random forest, and we expose two hyperparameters\u0026mdash;the maximum tree depth and number of trees\u0026mdash;as command line arguments.\n\nEach script is instrumented with MLflow to log the hyperparameters used and the accuracy of the trained model on a train and test set.\n\nTo train the k-nearest neighbors model, start a CML session and run `!python3 scripts/train_kneighbors.py` in the session Python prompt, or without the bang (`!`) in the session terminal.\nThis will train the model with the default (5) nearest neighbors.\nTo run with a different number of neighbors, pass a command line argument like so:\n\n```bash\n!python3 scripts/train_kneighbors.py --n-neighbors 3\n```\n\nIf the code was imported as an Applied Machine Learning Prototype, the declarative project will have set up a job for each training script, and executed each once, using the default hyperparameters.\nFeel free to run the scripts some additional times, passing different hyperparameters.\nThis can be done in any of three ways:\n\n1. By re-running the jobs after changing the default hyperparameter values in the script.\n2. Interactively in a Python session.\n3. At the command line in a session terminal, as described above.\n\n## Viewing the MLflow UI\n\nSince our training scripts were instrumented with MLflow, the parameters, metrics, models and additional metadata associated with any training runs will have been logged in the `mlruns` directory.\nWe can investigate the performance metrics for each run using MLflow's UI.\nThe automated setup will have created a CML Application called \"MLflow UI\" that can be visited from the Applications tab of CML, and will look something like this.\n\n![MLflow UI in CML](docs/images/mlflow-ui.png)\n\nWe can now interact with the MLfLow UI as if it were running on our local machine to compare model training runs.\n\nYou can start the MLflow UI manually inside a session with\n\n```bash\n!mlflow ui --port $CDSW_READONLY_PORT\n```\n\nWhen launched from a session, the UI will be listed in the nine-dot menu in the upper right corner of the session interface.\nClicking it will open a new browser tab with the UI.\nWhen launched in a session, the UI will block other uses of the session, and will be closed when the session closes.\nIt's not recommended to run two simultaneous copies of the MLflow interface (i.e. both as an Application and inside a session).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcloudera%2Fcml_amp_mlflow_tracking","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcloudera%2Fcml_amp_mlflow_tracking","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcloudera%2Fcml_amp_mlflow_tracking/lists"}